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Preface 


While significant changes have been made in the current edition from its predecessor, the 
authors have tried to keep the discussion at the same level of accessibly, that is, less math- 
ematical than the measure theory approach but more rigorous than formula and recipe 
manuals. 

It has been said that probability is hard to understand, not so much because of its 
mathematical underpinnings but because it produces many results that are counter intuitive. 
Among practically oriented students, Probability has many critics. Foremost among these are 
the ones who ask, “What do we need it for?” This criticism is easy to answer because future 
engineers and scientists will come to realize that almost every human endeavor involves 
making decisions in an uncertain or probabilistic environment. This is true for entire fields 
such as insurance, meteorology, urban planning, pharmaceuticals, and many more. Another, 
possibly more potent, criticism is, “What good is probability if the answers it furnishes are 
not certainties but just inferences and likelihoods?” The answer here is that an immense 
amount of good planning and accurate predictions can be done even in the realm of uncer- 
tainty. Moreover, applied probability—often called statistics—does provide near certainties: 
witness the enormous success of political polling and prediction. 

In previous editions, we have treaded lightly in the area of statistics and more heavily 
in the area of random processes and signal processing. In the electronic version of this book, 
graduate-level signal processing and advanced discussions of random processes are retained, 
along with new material on statistics. In the hard copy version of the book, we have dropped 
the chapters on applications to statistical signal processing and advanced topics in random 
processes, as well as some introductory material on pattern recognition. 

The present edition makes a greater effort to reach students with more expository 
examples and more detailed discussion. We have minimized the use of phrases such as, 
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un ” 


“it is easy to show...”, “it can be shown...”, “it is easy to see...,” and the like. Also, 
we have tried to furnish examples from real-world issues such as the efficacy of drugs, 
the likelihood of contagion, and the odds of winning at gambling, as well as from digital 
communications, networks, and signals. 

The other major change is the addition of two chapters on elementary statistics and its 
applications to real-world problems. The first of these deals with parameter estimation and 
the second with hypothesis testing. Many activities in engineering involve estimating para- 
meters, for example, from estimating the strength of a new concrete formula to estimating 
the amount of signal traffic between computers. Likewise many engineering activities involve 
making decisions in random environments, from deciding whether new drugs are effective to 
deciding the effectiveness of new teaching methods. The origin and applications of standard 
statistical tools such as the t-test, the Chi-square test, and the F-test are presented and 
discussed with detailed examples and end-of-chapter problems. 

Finally, many self-test multiple-choice exams are now available for students at the book 
Web site. These exams were administered to senior undergraduate and graduate students 
at the Illinois Institute of Technology during the tenure of one of the authors who taught 
there from 1988 to 2006. The Web site also includes an extensive set of small MATLAB 
programs that illustrate the concepts of probability. 

In summary then, readers familiar with the 3°¢ edition will see the following significant 
changes: 


e A new chapter on a branch of statistics called parameter estimation with many illus- 
trative examples; 

e A new chapter on a branch of statistics called hypothesis testing with many illustrative 
examples; 

e A large number of new homework problems of varying degrees of difficulty to test the 
student’s mastery of the principles of statistics; 

e A large number of self-test, multiple-choice, exam questions calibrated to the material 
in various chapters available on the Companion Web site. 

e Many additional illustrative examples drawn from real-world situations where the 
principles of probability and statistics have useful applications; 

e A greater involvement of computers as teaching/learning aids such as (i) graphical 
displays of probabilistic phenomena; (ii) MATLAB programs to illustrate probabilistic 
concepts; (iii) homework problems requiring the use of MATLAB/ Excel to realize 
probability and statistical theory; 

e Numerous revised discussions—based on student feedback—meant to facilitate the 
understanding of difficult. concepts. 


Henry Stark, IIT 
Professor Emeritus 


John W. Woods, Rensselaer 
Professor 


The publishers would like to thank Dr Murari Mitra and Dr Tamaghna Acharya of 
Bengal Engineering and Science University for reviewing content for the International 
Edition. 


i Introduction to Probability 


1.1 INTRODUCTION: WHY STUDY PROBABILITY? 


One of the most frequent questions posed by beginning students of probability is, “Is 
anything truly random and if so how does one differentiate between the truly random 
and that which, because of a lack of information, is treated as random but really isn’t?” 
First, regarding the question of truly random phenomena, “Do such things exist?” As we 
look with telescopes out into the universe, we see vast arrays of galaxies, stars, and planets 
in apparently random order and position. 

At the other extreme from the cosmic scale is what happens at the atomic level. Our 
friends the physicists speak of such things as the probability of an atomic system being in 
a certain state. The uncertainty principle says that, try as we might, there is a limit to 
the accuracy with which the position and momentum can be simultaneously ascribed to a 
particle. Both quantities are fuzzy and indeterminate. 

Many, including some of our most famous physicists, believe in an essential random- . 
ness of nature. Eugen Merzbacher in his well-known textbook on quantum mechanics [1-1] 
writes, 


The probability doctrine of quantum mechanics asserts that the indetermination, of 
which we have just given an example, is a property inherent in nature and not merely a 
profession of our temporary ignorance from which we expect to be relieved by a future 
better and more complete theory. The conventional interpretation thus denies the 
possibility of an ideal theory which would encompass the present quantum mechanics 
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but would be free of its supposed defects, the most notorious “imperfection” of quantum 
mechanics being the abandonment of strict classical determinism. 


But the issue of determinism versus inherent indeterminism need never even be consid- 
ered when discussing the validity of the probabilistic approach. The fact remains that there 
is, quite literally, a nearly uncountable number of situations where we cannot make any 
categorical deterministic assertion regarding a phenomenon because we cannot measure all 
the contributing elements. Take, for example, predicting the value of the noise current i(t) 
produced by a thermally excited resistor R. Conceivably, we might accurately predict i(t) 
at some instant t in the future if we could keep track, say, of the 107° or so excited electrons 
moving in each other’s magnetic fields and setting up local field pulses that eventually all 
contribute to producing i(t). Such a calculation is quite inconceivable, however, and there- 
fore we use a probabilistic model rather than Maxwell’s equations to deal with resistor noise. 
Similar arguments can be made for predicting the weather, the outcome of tossing a real 
physical coin, the time to failure of a computer, dark current in a CMOS imager, and many 
other situations. Thus, we conclude: Regardless of which position one takes, that is, deter- 
minism versus indeterminism, we are forced to use probabilistic models in the real world 
because we do not know, cannot calculate, or cannot measure all the forces contributing to 
an effect. The forces may be too complicated, too numerous, or too faint. 

Probability is a mathematical model to help us study physical systems in an average 
sense. We have to be able to repeat the experiment many times under the same conditions. 
Probability then tells us how often to expect the various outcomes. Thus, we cannot use 
probability in any meaningful sense to answer questions such as “What is the probability 
that a comet will strike the earth tomorrow?” or “What is the probability that there is life 
on other planets?” The problem here is that we have no data from similar “experiments” 
in the past. 

R. A. Fisher and R. Von Mises, in the first third of the twentieth century, were 
largely responsible for developing the groundwork of modern probability theory. The modern 
axiomatic treatment upon which this book is based is largely the result of the work by Andrei 
N. Kolmogorov [1-2]. 


1.2 THE DIFFERENT KINDS OF PROBABILITY 


There are essentially four kinds of probability. We briefly discuss them here. 


Probability as Intuition 


This kind of probability deals with judgments based on intuition. Thus, “She will probably 
marry him” and “He probably drove too fast” are in this category. Intuitive probability 
can lead to contradictory behavior. Joe is still likely to buy an imported Itsibitsi, world 
famous for its reliability, even though his neighbor Frank has a 19-year-old Buick that has 
never broken down and Joe’s other neighbor, Bill, has his Itsibitsi in the repair shop. Here 
Joe may be behaving “rationally,” going by the statistics and ignoring, so-to-speak, his 
personal observation. On the other hand, Joe will be wary about letting his nine-year-old 
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daughter Jane swim in the local pond, if Frank reports that Bill thought that he might 
have seen an alligator in it. This despite the fact that no one has ever reported seeing 
an alligator in this pond, and countless people have enjoyed swimming in it without ever 
having been bitten by an alligator. To give this example some credibility, assume that the 
pond is in Florida. Here Joe is ignoring the statistics and reacting to, what is essentially, 
a rumor. Why? Possibly because the cost to Joe “just-in-case” there is an alligator in the 
pond would be too high [1-3]. 

People buying lottery tickets intuitively believe that certain number combinations like 
month/day/year of their grandson’s birthday are more likely to win than say, 06-06-06. 
How many people will bet even odds that a coin that, heretofore has behaved “fairly,” that 
is, in an unbiased fashion, will come up heads on the next toss, if in the last seven tosses it 
has come up heads? Many of us share the belief that the coin has some sort of memory and 
that, after seven heads, that coin must “make things right” by coming up with more tails. 

A mathematical theory dealing with intuitive probability was developed by 
B. O. Koopman [1-4]. However, we shall not discuss this subject in this book. 


Probability as the Ratio of Favorable to Total Outcomes 
(Classical Theory) 


In this approach, which is not experimental, the probability of an event is computed a priori! 
by counting the number of ways ng that E can occur and forming the ratio nz/n, where 
n is the number of all possible outcomes, that is, the number of all alternatives to E plus 
ng. An important notion here is that all outcomes are equally likely. Since equally likely 
is really a way of saying equally probable, the reasoning is somewhat circular. Suppose we 
throw a pair of unbiased six-sided dicet and ask what is the probability of getting a 7. We 
partition the outcome space into 36 equally likely outcomes as shown in Table 1.2-1, where 
each entry is the sum of the numbers on the two dice. 


Table 1.2-1 Outcomes of Throwing 
Two Dice 











2nd die 
2 3 4 5 6 7 
3 4 5 6 7 8 
4 5 6 7 8 9 
5 6 7 8 9 10 
6 7 8 9 10 11 
7 8 9 10 11 12 





tA priori means relating to reasoning from self-evident propositions or prior experience. The related 
phrase, a posteriori means relating to reasoning from observed facts.’ 
tWe will always assume that our dice have six sides. 
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The total number of outcomes is 36 if we keep the dice distinct. The number of ways 
of getting a 7 is ny = 6. Hence 
6 1 
Pigetti (=z ==. 
Igetting a 7] = $ =i 





Example 1.2-1 
(toss a fair coin twice) The possible outcomes are HH, HT, TH, and TT. The probability 
of getting at least one tail T is computed as follows: With E denoting the event of getting 
at least one tail, the event E is the set of outcomes 


E = {HT,TH,TT}. 


Thus, event E occurs whenever the outcome is HT or TH or TT. The number of elements 
in E is ng = 3; the number of all outcomes N, is four. Hence 
3 
Plat least one T] = REY. 
n 4 
Note that since no physical experimentation is involved, there is no problem in postulating 
an ideal “fair coin.” Effectively, in classical probability every experiment is considered 
“fair.” 





The classical theory suffers from at least two significant problems: (1) It cannot deal 
with outcomes that are not equally likely; and (2) it cannot handle an infinite number 
of outcomes, that is when n = oo. Nevertheless, in those problems where it is impractical 
to actually determine the outcome probabilities by experimentation and where, because of 
symmetry considerations, one can indeed argue equally likely outcomes, the classical theory 
is useful. 

Historically, the classical approach was the predecessor of Richard Von Mises’ [1-6] 
relative frequency approach developed in the 1930s, which we consider next. 


Probability as a Measure of Frequency of Occurrence 


The relative frequency approach to defining the probability of an event E is to perform 
an experiment n times. The number of times that E appears is denoted by ng. Then it is 
tempting to define the probability of E occurring by 


P(E] = lim —. (1.2-1) 


Quite clearly since ng < n we must have 0 < P[E] < 1. One difficulty with this approach 
is that we can never perform the experiment an infinite number of times, so we can only 
estimate P[E] from a finite number of trials. Secondly, we postulate that ng/n approaches 
a limit as n goes to infinity. But consider flipping a fair coin 1000 times. The likelihood 
of getting exactly 500 heads is very small; in fact, if we flipped the coin 10,000 times, the 
likelihood of getting exactly 5000 heads is even smaller. As n — oo, the event of observing 
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exactly n/2 heads becomes vanishingly small. Yet our intuition demands that P[head] = 
for a fair coin. Suppose we choose a 6 > 0; then we shall find experimentally that if the coi 
is truly fair, the number of times that 


B Nile 


nE J > 6, (1.2-2) 
n 2 





as n becomes large, becomes very small. Thus, although it is very unlikely that at any stage 
of this experiment, especially when n is large, ng/n is exactly a this ratio will nevertheless 
hover around 4, and the number of times it will make significant excursion away from the 
vicinity of 4 according to Equation 1.2-2, becomes very small indeed. 

Despite these problems with the relative frequency definition of probability, the relative 
frequency concept is essential in applying probability theory to the physical world. 





Example 1.2-2 
(random.org) An Internet source of random numbers is RANDOM. ORG, which was founded 
by a professor in the School of Computer Science and Statistics at Trinity College, Dublin, 
Ireland. It calculates random digits as a function of atmospheric noise and has passed 
many statistical tests for true randomness. Using one of the site’s free services, we have 
downloaded 10,000 random numbers, each taking on values from 1 to 100 equally likely. We 
have written the MATLAB function RelativeFrequencies() that takes this file of random 
numbers and plots the ratio nz/n as a function of the trial number n = 1,..., 10,000. We 
can choose the event E to be the occurrence of any one of the 100 numbers. For example for 


EA {occurrence of number 5}, the number ng counts the number of times 5 has occurred 
among the 10,000 numbers up to position n. A resulting output plot is shown in Figure 1.2- 
1, where we see a general tendency toward convergence to the ideal value of 0.01 = 1/100 
for 100 equally likely numbers. An output plot for another number choice 23 is shown in 
Figure 1.2-2 again showing a general tendency to converge to the ideal value here of 0.01. In 
both cases though, we note that the convergence is not exact at any value of n, but rather 
just convergence to a small neighborhood of the ideal value. 
This program is available at this book’s website. 





Probability Based on an Axiomatic Theory 


The axiomatic approach is followed in most modern textbooks on the subject. To develop it 
we must introduce certain ideas, especially those of a random experiment, a sample space, 
and an event. Briefly stated, a random experiment is simply an experiment in which the 
outcomes are nondeterministic, that is, more than one outcome can occur each time the 
experiment is run. Hence the word random in random experiment. The sample space is the 
set of all outcomes of the random experiment. An event is a subset. of the sample space that 
satisfies certain constraints. For example, we want to be able to calculate the probability for 
each event. Also in the case of noncountable or continuous sample spaces, there are certain 
technical restrictions on what subsets can be called events. An event with only one outcome 
will be called a singleton or elementary event. These notions will be made more precise in 
Sections 1.4 and 1.5. 
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Figure 1.2-1 Plot of ne/n for E = {occurrence of number 5} versus n from atmospheric noise 
(from website RANDOM. ORG). 





1000 2000 3000 4000 5000 6000 7000 8000 9000 10,000 


Figure 1.2-2 Plot of ne/n for E = {occurrence of number 23} versus n from atmospheric noise (from 
website RANDOM.ORG). 
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1.3 MISUSES, MISCALCULATIONS, AND PARADOXES IN PROBABILITY 


The misuse of probability and statistics in everyday life is quite common. Many of the 
misuses are illustrated by the following examples. Consider a defendant in a murder trial 
who pleads not guilty to murdering his wife. The defendant has on numerous occasions 
beaten his wife. His lawyer argues that, yes, the defendant has beaten his wife but that 
among men who do so, the probability that one of them will actually murder his wife is 
only 0.001, that is, only one in a thousand. Let us assume that this statement is true. It 
is meant to sway the jury by implying that the fact of beating one’s wife is no indicator 
of murdering one’s wife. Unfortunately, unless the members of the jury have taken a good 
course in probability, they might not be aware that a far more significant question is the 
following: Given that a battered wife is murdered, what is the probability that the husband is 
the murderer? Statistics show that this probability is, in fact, greater than one-half. 

In the 1996 presidential race, Senator Bob Dole’s age became an issue. His opponents 
claimed that a 72-year-old white male has a 27 percent risk of dying in the next five years. 
Thus it was argued, were Bob Dole elected, the probability that he would fail to survive his 
term was greater than one-in-four. The trouble with this argument is that the probability 
of survival, as computed, was not conditioned on additional pertinent facts. As it happens, 
if a 72-year-old male is still in the workforce and, additionally, happens to be rich, then 
taking these additional facts into consideration, the average 73-year-old (the age at which 
Dole would have assumed the presidency) has only a one-in-eight chance of dying in the 
next four years [1-3]. 

Misuse of probability appears frequently in predicting life elsewhere in the universe. 
In his book Probability 1 (Harcourt Brace & Company, 1998), Amir Aczel assures us 
that we can be certain that alien life forms are out there just waiting to be discovered. 
However, in a cogent review of Aczel’s book, John Durant of London’s Imperial College 
writes, 


Statistics are extremely powerful and important, and Aczel is a very clear and capable 
exponent of them. But statistics cannot substitute for empirical knowledge about the 
way the universe behaves. We now have no plausible way of arriving at robust estimates 
about the way the universe behaves. We now have no plausible way of arriving at 
robust estimates for the probability of life arriving spontaneously when the conditions 
are right. So, until we either discover extraterrestrial life or understand far more about 
how at least one form of life—terrestrial life—first appeared, we can do little more 
than guess at the likelihood that life exists elsewhere in the universe. And as long as 
we're guessing, we should not dress up our interesting speculations as mathematical 
certainties. 


The computation of probabilities based on relative frequency can lead to paradoxes. An 
excellent example is found in [1-3]. We repeat the example here: 


In a sample of American women between the ages of 35 and 50, 4 out of 100 develop 
breast cancer within a year. Does Mrs. Smith, a 49-year-old American woman, there- 
fore have a 4% chance of getting breast cancer in the next year? There is no answer. 
Suppose that in a sample of women between the ages of 45 and 90—a class to which 
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Mrs. Smith also belongs—11 out of 100 develop breast cancer in a year. Are 
Mrs. Smith’s chances 4%, or are they 11%? Suppose that her mother had breast cancer, 
and 22 out of 100 women between 45 and 90 whose mothers had the disease will develop 
it. Are her chances 4%, 11%, or 22%? She also smokes, lives in California, had two 
children before the age of 25 and one after 40, is of Greek descent .... What group 
should we compare her with to figure out the “true” odds? You might think, the more 
specific the class, the better—but the more specific the class, the smaller its size and 
the less reliable the frequency. If there were only two people in the world very much 
like Mrs. Smith, and one developed breast cancer, would anyone say that Mrs. Smith’s 
chances are 50%? In the limit, the only class that is truly comparable with Mrs. Smith 
in all her details is the class containing Mrs. Smith herself. But in a class of one 
“relative frequency” makes no sense. 


The previous example should not leave the impression that the study of probability, 
based on relative frequency, is useless. For one, there are a huge number of engineering and 
scientific situations that are not nearly as complex as the case of Mrs. Smith’s likelihood of 
getting cancer. Also, it is true that if we refine the class and thereby reduce the class size, 
our estimate of probability based on relative frequency becomes less stable. But exactly 
how much less stable is deep within the realm of the study of probability and its offspring 
statistics (e.g., see the Law of Large Numbers in Section 4.4). Also, there are many situations 
where the required conditioning, that is, class refinement, is such that the class size is 
sufficiently large for excellent estimates of probability. And finally returning to Mrs. Smith, 
if the class size starts to get too small, then stop adding conditions and learn to live with 
a probability estimate associated with a larger, less refined class. This estimate may be 
sufficient for all kinds of actions, that is, planning screening tests, and the like. 


1.4 SETS, FIELDS, AND EVENTS 


A set is a collection of objects, either concrete or abstract. An example of a set is the set of 
all New York residents whose height equals or exceeds 6 feet. A subset of a set is a collection 
that is contained within the larger set. Thus, the set of all New York City residents whose 
height is between 6 and 65 feet is a subset of the previous set. In probability theory we call 
sets events. We are particularly interested in the set of all outcomes of a random experiment 
and subsets of this set. We denote the set of all outcomes by Q, and individual outcomes 
by ¢.1 The set Q is called the sample space of the random experiment. Certain subsets of 
Q, whose probabilities we are interested in, are called events. In particular Q itself is called 
the certain event and the empty ¢ set is called the null event. 


Examples of Sample Spaces 


Example 1.4-1 — > S > 
(coin flip) The experiment consists of flipping a coin once. Then Q = {H, T}, where H is a 
head and T is a tail. 


tGreek letter Ç is pronounced zeta. 
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Example 1.4-2 — SS 
(coin flip twice) The experiment consists of flipping a coin twice. Then Q = {HH, HT, 
TH, TT}. One of sixteen subsets of Q is E={HH, HT, TH}; it is the event of getting at least 
one head in two flips. 


Example 1.4-3 — Ž —— ~ > 
(hair on head) The experiment consists of choosing a person at random and counting the 
hairs on his or her head. Then 


Q = {0,1,2,...,107}, 


that is, the set of all nonnegative integers up to 10’, it being assumed that no human head 
has more than 10” hairs. 


Example 1.4-4 
(couple’s ages) The experiment consists of determining the age to the nearest year of each 
member of a married couple chosen at random. Then with x denoting the age of the man 
and y denoting the age of the woman, Q is described by 





Q = {2-tuples (x,y): x any integer in 10—200; y any integer in 10—200}. 


Note that in Example 1.4-4 we have assumed that no human lives beyond 200 years and that 
no married person is ever less than ten years old. Similarly, in Example 1.4-1, we assumed 
that the coin never lands on edge. If the latter is a possible outcome, it must be included 
in Q in order for it to denote the set of all outcomes as well as the certain event. 


Example 1.4-5 —— 
(angle in elastic collision) The experiment consists of observing the angle of deflection of a 
nuclear particle in an elastic collision. Then 


Q= {0: -r <0 <r}. 


An example of an event or subset of Q is 


Example 1.4-6 — >>> 
(electrical power) The experiment consists of measuring the instantaneous power P consumed 
by a current-driven resistor. Then 


Q={P:P>0} 


Since power cannot be negative, we leave out negative values of P in Q. A subset of Q is 
the event E = {P > 107? watts}. 
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Note that in Examples 1.4-5 and 1.4-6, the number of elements in 2 is uncountably infinite. 
Therefore, there are an uncountably infinite number of subsets. When, as in Example 1.4-4, 
the number of outcomes is finite, the number of distinct subsets is also finite, and each 
represents an event. Thus, if Q = {C,,...,¢y}, the number of possible subsets of Q is 2%. 
We can see this by noting that each element Ç; either is or is not present in any given subset 
of Q. This gives rise to 2% distinct subsets or events, including the certain event and the 
impossible or null event. 


Review of set theory. The union (sum) of two sets E and F, written EU F or E + F, is 
the set of all elements that are in at least one of the sets E and F. Thus, with E = {1, 2,3, 4} 
and F = {1,3,4,5, 6},! 

EUF = {1,2,3,4,5, 6}. 
If E is a subset of F, we indicate this by writing E C F. Clearly for E C F it follows that 
EUF = F. We indicate that Ç is an element of 2 or “belongs” to 2 by writing ¢ € Q. 
Thus, we can write 


EUF={¢:¢¢€ Force Fh}, (1.4-1) 


where the “or” here is inclusive. Clearly E U F = F U E. The intersection or set product 
of two sets E and F, written EN F or just EF, is the set of elements common to both E 
and F. Thus, in the preceding example 


EF = {1,3,4}. 


Formally, EF 4 {Ç: Ç € Eand¢ € F} = FE. The complement of a set E, written E°, is 
the set of all elements not in E. From this it follows that if Q is the sample space or, more 
generally, the universal set, then 


EUE =Q. (1.4-2) 


Also EE" = ¢. The set difference of two sets or, more appropriately, the reduction of E by 
F, written E — F, is the set made up of elements in Æ that are not in F. It should be clear 
that 


E-FAEF, 
F-EÊ FES, 


but be careful. Set difference does not behave like difference of numbers, for example, 
F- E-E = F- E. The ezclusive or of two sets, written E @ F, is the set of all elements 
in E or F but not both. It is readily shown that? 


E®F=(E-F)U(F—BE). (1.4-3) 


tRemember, the order of the elements in a set is not important. 

tEquation 1.4-3 shows why U is preferable to + to indicate union. The beginning student might—in 
error—write (E — F) + (F — E) = E — F + F — E = 0, which is meaningless. Note also that F + F 4 2F, 
which is also meaningless. In fact F + F = F. So, only use + and - operators in set theory with care. 
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(a) EUF (b) EF 


(d) E-F (e) F-E (g) ESF 


Figure 1.4-1 Venn diagrams for set operations. 


The operation of unions, intersections, and so forth can be illustrated by Venn diagrams, 
which are useful as aids in reasoning and in establishing probability relations. The various 
set operations E U F, EF, E°, E — F, F — E, E Ẹ F are shown in Figure 1.4-1 in hatch 
lines. 

Two sets E, F are said to be disjoint if EF = ¢; that is, they have no elements in 
common. Given any set E, an n-partition of E consists of a sequence of sets E;, where 
i=1,...,n, such that F; C E, U; E: = E, and EE; = ¢ for all i Æ j. Thus, given two 
sets E, F, a 2-partition of F is 


F = FEUFE‘. (1.4-4) 


It is easy to see, using Venn diagrams, the following results: 


(EU F) = ECF" (1.45) 
(EF) = EU F° (1.4-6) 
and, by induction,t given sets £1,..., En: 


tSee Section A.4 in Appendix A for the meaning of mathematical induction. 
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| e =(\E (1.4-7) 
i=1 i=l 


IA z. = U ES. (1.4-8) 
i=1 i=l 


The relations are known as De Morgan’s laws after the English mathematician Augustus 
De Morgan (1806-1871). 

While Venn diagrams allow us to visualize these results, they don not really achieve their 
proof. Towards this end, consider the mathematical definition of equality of two sets. Two 
sets E and F are said to be equal if every element in F is in F and vice versa. Equivalently, 


E=F if ECF, andF CE. (1.4-9) 


Example 1.4-7 
(proving equality of sets) If we want to strictly prove one of the above set equalities, say 
Eq. 1.4-4, F = FEU FE”, we must proceed as follows. First show F C FEU FE” and then 
show F > FEU FE”. To show F C FEU FE*, we consider an arbitrary element ç € F, 
then ç must be in either FE or FE‘ for any set E, and thus ç must be in FE UFE*. This 
establishes that F C FEU FE". Going the other way, to show F > FEU FE”, we start 
with an arbitrary element ç € FE U FE". It must be that ç belongs to either FE or FE" 
and so ç must belong to F, thus establishing F D FEU FE‘. Since we have shown the two 
set inclusions, we may write F = FE U FE‘ meaning that both sets are equal. 








Using this method you can establish the following helpful laws of set theory: 


1. associative law for union 

AU(BUC)=(AUB)UC 
2. associative law for intersection 

A(BC) = (AB)C 

3. distributive law for union 

AU (BC) = (AU B)(AUC) 
4. distributive law for intersection 

A(BUC) = (AB) U (AC) 


We will use these identities or laws for analyzing set equations below. However, these 
four laws must be proven first. Here, as an example, we give the proof of 1. 


Example 1.4-8 
(proof of associative law for union) We wish to prove AU(B UC) = (AU B)UC. To do this 
we must show that both AU(B U C) C (AU B)UC and AU(BUC) D (AU B)UC. Starting 
with the former, assume that ¢ € AU(B UC); then it follows that ¢ is in A or in BUC. But 
then Ç is in A or B or C, so it is in AUB or in C, which is the same as saying ¢ € (AU B)UC. 
To complete the proof, we must go the other way and show starting with ¢ € (AU B) UC 
that ¢ must also be an element of AU (BUC). This part is left for the student. 
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Sigma fields. Consider a universal set Q and a certain collection of subsets of Q. Let E 
and F be two arbitrary subsets in this collection. This collection of subsets forms a field 
Mb if 


(1) Q E€ M, QEM. 
(2) IE E € Æ and F € A, then EU F €.4, and EF € æt 
(3) If E € Æ, then E° € æ. 


We will need to consider fields of sets (fields of events in probability) in order to avoid 
some problems. If our collection of events were not a field, then we could define a probability 
for some events, but not for their complement; that is, we could not define the probability 
that these events do not occur! Similarly we need to be able to consider the probability of 
the union of any two events, that is, the probability that either ‘or both’ of the two events 
occur. Thus, for probability theory, we need to have a probability assigned to all the events 
in the field. 

Many times we will have to consider an infinite set of outcomes and events. In that 
case we need to extend our definition of field. A sigma (o) field? Fis a field that is 
closed under any countable number of unions, intersections, and complementations. Thus, 
if E1,..., En, ... belong to Zso do 


oo oo 
U Ei and N Ei, 
i=1 i=1 


where these are simply defined as 


U Ei 4 {the set of all elements in at least one Ei} 
i=l 


and 


co 

AE: £ {the set of all elements in every F;}. 

i=1 
Note that these two infinite operations of union and intersection would be meaningless 
without a specific definition. Unlike infinite summations which are defined by limiting oper- 
ations with numbers, there are no such limiting operations defined on sets, hence the need 
for a definition. 


Events. Consider a probability experiment with sample space 2. If Q has a countable 
number of elements, then every subset of Q may be assigned a probability in a way consistent 
with the axioms given in the next section. Then the class of all subsets will make up a field 
or o-field simply because every subset is included. This collection of all the subsets of Q is 
called the largest o-field. 


tFrom this it follows by mathematical induction, that if Ey ,--.,En belongs to .@ so do Un, Be Mb 
and ne, E; € &@. 
t Also sometimes called a o-algebra. 
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Sometimes though we do not have enough probability information to assign a probability 
to every subset. In that case we need to define a smaller field of events that is still a o-field, 
but just a smaller one. We will discuss this matter in the example below. Going to the limit, 
if we have no probability information, then we must content ourselves with the smallest 
o-field of events consisting of just the null event ¢ and the certain event Q. While this 
collection of two events is a o-field it is not very useful. 


Example 1.4-9 — > > S S 
(field generated by two events) Assume we have interest in only two events A and B in 
an arbitrary sample space 2 and we desire to find the smallest field containing these two 
events. We can proceed as follows. First we generate a disjoint decomposition of the sample 
space as follows. 


Q = Q(A U A°)(B U B®) 
= AB U AB° U AFB U AF B°. 


Next generate a collection of events from these four basic disjoint (non-overlapping) events 
as follows: The first four events are AB, AB° A°B, and A° B°. Then we add the pairwise 
unions of these disjoint events: AB U ABS, AB U ACB, and AB U A° B°. Finally we add 
the unions of tripples of these four disjoint events. The total number of events will then be 
2x 2x 2x 2 = 24 = 16, since each of the four basic disjoint events can be included, or not, 
in the event. 

This collection of events is guarenteed to be a field, since we construct each of its 
16 events from the four basic disjoint events, thus ensuring that complements are in the 
collection via Q = ABUABSUASBUA‘B*®. Unions are trivially in the collection too. Because 
all the events in the collection are built up from the four disjoint events, complements are 
just the events that have been left out, eg. (ABU AB*)° = A°BU A° B° which is recognized 
as being in the collection. Hence we have a field. In fact this is the smallest field that 
contains the events A and B. We call this the field generated by events A and B. Can you 
show that event A is in this field? 





When Q is not countable, for example, when Q = R! = the real line, advanced mathe- 
matics (measure theory) has found that not every subset of Q can be assigned a probability 
(is measurable) in a way that will be consistent. So we must content ourselves with smaller 
collections of subsets of the universal event Q that form a o-field. On the real line R! for 
example, we can generate a o-field from all the intervals, open/closed, and this is called the 
Borel field of events on the real line. As a practical matter, it has been found that the Borel 
field on the real line includes all subsets of engineering and scientific interest.' 

At this stage of our development, we have two of the three objects required for the 
axiomatic theory of probability, namely, a sample space Q of outcomes Ç, and a o-field .¥of 
events defined on Q. We still need a probability measure P. The three objects (Q,.% P) form 
a triple called the probability space P that will constitute our mathematical model. However, 
the probability measure P must satisfy the following three axioms due to Kolmogorov. 


tFor two-dimensional Euclidean sample spaces, the Borel field of events would be subsets of R! x R! = 
R?; for three-dimensional sample spaces, it would be subsets of R! x R! x R! = R3. 
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1.5 AXIOMATIC DEFINITION OF PROBABILITY 


Probability is a set function P[-] that assigns to every event E € Fa number P[E] called 
the probability of event E such that 


(1) P{E] > 0. (1.5-1) 
(2) PQ] =1. (1.5-2) 
(3) PIEU F] = P|E] + P|F] if EF =4¢. (1.5-3) 


The probability measure is not like an ordinary function. It does not take numbers for 
its argument, but rather it takes sets; that is, it is a measure of sets, our mathematical 
model for events. Since this is a special function, to distinguish it we will always use square 
brackets for its argument, a set of outcomes ¢ in the sample space 12. 

These three axioms are sufficient! to establish the following basic results, all but one of 
which we leave as exercises for the reader. Let E and F be events contained in ¥ then 


(4) P{¢] =0. (1.5-4) 
(5) PIEF‘] = PIE] — PEF), (1.5-5) 
(6) P[E] = 1 — PIE‘). (1.5-6) 
(7) P|EU F] = P[E]+ P[F] - P|EF]. (1.5-7) 


From Axiom 3 we can establish by mathematical induction that 


P 





n n 
U z: =) 0 PIE] if E;E;=¢ forall iF). (1.5-8) 
i=l i=1 

From this result and Equation 1.5-7, we can establish by induction, the general result 
that P [U Ei] < SL, P[E;]. This result is sometimes known as the union bound, often 
used in digital communications theory to provide an upper bound on the probability of 
error. 


Example 1.5-1 
(probability of the union of two events) We wish to prove result (7). First we decompose 
the event EU F into three disjoint events as follows: 


BUF=EF°UEFUEF. 





By Axiom 3 
P[E U F] = P[EF" u E°F] + P[EF] 
= P[EF"] + P|E°F] + P|EF], by Axiom 3 again 
= P[E] — P[EF] + P[F] — P[EF] + P[EF] 
= P[E] + P[F] — P[EF}. (1.5-9) 


tA fourth axiom: P (US, Ei] = Z2, P[Ei] if EE; = ¢ for all i # j must be included to enable one 
to deal rigorously with limits and countable unions. This axiom is of no concern to us here but will be in 
later chapters. 
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We can apply this result to the following problem. 
In a certain bread store, there are two events of interest W £ {white bread is available} 


and R ê {rye bread is available}. Based on past experience with this establishment, we 
take P[W] = 0.8 and, P[R] = 0.7. We also know that the probability that both breads are 
present is 0.6, that is, P[W R] = 0.6. We now ask what is the probability of either bread 
being present, that is, what is P[W U R]? The answer is obtained basic result (7) as 


P[W U R] = P[W] + P[R] — PIW R] 
=0.8+0.7— 0.6 ` 
= 0.9. 





We pause for a bit of terminology. We say an event E occurs whenever the outcome 
of our experiment is one of the elements in Æ. So “P[E]” is read as “the probability that 
event E occurs.” 


A measure of events not outcomes. The reader will have noticed that we talk of 
the probability of events rather than the probability of outcomes. For finite and countable 
sample spaces, we could just as well talk about probabilities of outcomes; however, we do 
not do so for several reasons. One is we would still need to talk about probabilities of events 
and so would need two types of probability measures, one for outcomes and one for events. 
Second, in some cases we only know the probability of some events and don’t know the 
probabilistic detail to assign a probability to each outcome. Lastly, and most importantly, 
in the case of continuous sample spaces with uncountable outcomes, for example the points 
on the real number interval [0, 5] these may well have zero probability, and hence any theory 
based on probability of the outcomes would be useless. For these and other reasons we base 
our approach on events, and so probability measures events not outcomes. 


Example 1.5-2 
(toss coin once) The experiment consists of throwing a coin once. Our idealized outcomes 
are then H and T, with sample space: 








Q = {H, T}. 


The o-field of events consists of the following sets: {H}, {T}, 2, ¢. With the coin assumed 
fair, we havet 


PHH} = PHT} =3 PO=1, Pig =o. 


Example 1.5-3 
(toss die once) The experiment consists of throwing a die once. The outcomes are the 
number of dots ¢ = 1,...,6, appearing on the upward facing side of the die. The sample 











tRemember the outcome Ç is the output or result of our experiment. The set of all outcomes is the 
sample space Q. 
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space 2 is given by Q = {1,2,3,4,5,6}. The event field consists of 26 events, each one 
containing, or not, each of the outcomes i. Some events are 


$, Q, {1}, {1,2}, {1, 2, 3}, {1,4,6}, and {1, 2, 4, 5}. 
We assign probabilities to the elementary or singleton events {Ç} : 


P{QJ=2 i=1,...,6. 


All probabilities can now be computed from the basic axioms and the assumed probabilities 
for the elementary events. For example, with A = {1} and B = {2,3} we obtain P[A] = 3. 
Also P[AU B] = P[A]+ P[B], since AB = ¢. Furthermore, P[B] = P[{{2}] + P[{3}] = 2 so 
that 

P{[AUB] =% +2 =3- 
Example 1.5-4 
(choose ball from urn) The experiment consists of picking at random a numbered ball from 
12 balls numbered 1 to 12 in an urn. Our idealized outcomes are then the numbers ¢ = 1 
to 12, with sample space: 








Q = {1,..., 12}. 
Let the following events be specified 
Ai = {1,...,6}, B={3,...,9} 
AUB= {1,...,9}, AB = {3, 4,5, 6}, ABS = {1,2} 
BS = {1,2, 10, 11, 12}, AS = {7,..., 12}, A° B® = {10, 11, 12} 
(AB)* = {1, 2, 7, 8,9, 10, 11, 12}. 


Hence 
P[A] = P[{1}] + P[{2}] +... + P[{6}], 
P[B] = PH3H +... + P[{9}], 
P|AB] = P[{3}] +... + P[{6}]. 
If P[{1}] =... = P[{12}] = 4, then P[A] = 5, P[B] = 3, P[AB] = 4, and so forth. 





We point out that a theory of probability could be developed from a slightly different set of 
axioms [1-7]. However, whatever axioms are used and whatever theory is developed, for it 
to be useful in solving problems in the physical world, it must model our empirical concept 
of probability as a relative frequency and the consequences that follow from it. 


tWe say event A occurs whenever any of the number 1 through 6 appears on a ball removed from the 
urn. 
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Figure 1.5-1 Partitioning UL, E; into seven disjoint regions Az,..., Az. 


Probability of union of events. The extension of Equation 1.5-7 to the case of three 
events is straightforward but somewhat tedious. We consider three events, E1, Ez, E3, and 
wish to compute the probability, PIU, E;], that at least one of these events occurs. From 
the Venn diagram in Figure 1.5-1, we see seven disjoint regions in Us, E; which we label 
as A;,i=1,...,7. You can prove it using the same method used in Example 1.4-9. Then 
PIU, Ej] = P (U1, Ad = 2, PIA), from Axiom 3. 

In terms of the original events, the seven disjoint regions can be identified as 


Ay = E, ESE§ = E,(E2U E3)°, 


Ag = EEn E3, 
A3 = Ej E,E§ = E2(Ey U Es)’, 
A4 = E, £2 Es, 
As = Fi ESEs, 
Ag = ESEE, 


A7 = EES E3 = E3(Fy U E2)°. 


The computations of the probabilities P[A;], i = 1,...,7, follow from Equations 1.5-5 
and 1.5-7. Thus, we compute 


P[Ai] = P[E,] — P[E Ez U E1 E3] 
= P|E\] — {P[E1 E2] + P[E, E3] — P[E, E2E3]}- 
In obtaining the first line, we used Equation 1.5-5. In obtaining the second line, we used 


Equation 1.5-7. The computations of P[A,], i = 3,7, are quite similar to the computation 
P[A,] and involve the same sequence of steps. Thus, 
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P[A3] = P[E2] — {P[E1 F2] + P[E2E3] — P[E1 E2E3]}, 
P[A7] = P[E;] — {P[E1 Es] + P[E2E3] — P[E1E2E3]}, 
The computations of P[A2], P[As], and P[Ag] are also quite similar and involve applying 


Equation 1.5-5. Thus, 
P[A2] = P[E, E2] — P| Fi E2 E3], 


P[As] = P[E1 E3] — P[E1E2E3], 
P[Ag] = P[E2E3] — P|E1 E2Es), 
and finally, 
P[A4] = P[E, EE). 


Now, recalling that P{Ui=> E;] = D P[A;], we merely add all the P[A;] to obtain the 
desired result. This gives 


P 


i=1 





ÙJ J = 5 P[E:] — (PLE: Ea) + P[E, E3] + P[E2Es)) + P[E,E2E3].  (1.5-10) 


Note that this result makes sense because in adding the measures of the three events we 
have counted the double overlaps twice each. But if we subtract these three overlaps, we 
have not counted E, E2E3 at all, and so must add it back in. If we adopt the notation 


P,  P|Ei], Pj  P|E:E;], and Pj 2 P[E,E;E,], where 1 <i < j < k < 3, we can 
rewrite Equation 1.5-10 as 


3 3 
Us] =pR- E t E Po 
i=l i=l 


1<i<j<3 1<i<j<k<3 


P 





The last sum contains only one term, namely P123. Denote now each sum by the symbol S;, 
where the / denotes the number of subscripts associated with the terms in that sum. Then 


P 





3 3 
U 2 | = Sı — S2 + S3, where Sı 4 SOP: So 4 D Pij, and 
i=1 i= 1<i<j<3 


| 
M = 


S; Ê Pijr- 


1<i<j<k<3 


Why this introduction of new notation? Using the symbols S;,/ = 1,..., we can extend 
Equation 1.5-10 to the general case. 


Theorem 1.5-1 (probability of union of n events) The probability P that at least 
one among the events F1, E2,..., En occurs in a given experiment is given by 


P=S,—Sot+...+£Sn, 
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A wn Awan A 
where 51 = $; Pi, S2 = i<icj<n Pijs Sn = icicjck<...<lén Pyjx..1- The last sum 
has n subscripts and contains only one term. Em 


The proof of this theorem is given in [1-8, p. 89]. It can also be proved by induction; 
that is, assume that P = Sı — S2 + ... + Sn is true. Then show that for the case n + 1, 
P = Sı — S2 + ... F Sn41. We leave this exercise for the braver reader. 


1.6 JOINT, CONDITIONAL, AND TOTAL PROBABILITIES; INDEPENDENCE 


Assume that we perform the following experiment: We are in a certain U.S. city and wish 
to collect weather data about it. In particular we are interested in three events, call them 
A, B, and C, where 


A is the event that on any particular day, the temperature equals or exceeds 10°C; 
B is the event that on any particular day, the amount of precipitation equals or exceeds 
5 millimeters; 


C is the event that on any particular day A and B both occur, that is, C S AB. 


Since C is an event, we can compute P[C] = P[AB] and we call P|AB] the joint probability 
of the events A and B. This notion can obviously be extended to more than two events; 
that is, P[EFG] is the joint probability of events E, F, and G.t Now let ng denote the 
number of days on which event E occurred. Over a thousand-day period (n = 1000), the 
following observations are made: na = 811, ng = 306, nag = 283. By the relative frequency 
interpretation of probability 

811 


nA 
PJA] ~™4 = © = 0.811 
[4] n 1000 0.811, 


P|B] ~"2 = 0.306, 
n 
P[AB] ais = 0.283. 


Consider now the ratio n4g/n,. This would be the relative frequency with which event AB 
occurs given that event A occurs. Put into words, it is the fraction of time that the amount 
of precipitation equals or exceeds 5 millimeters on those days given that the temperature 
equals or exceeds 10°C. Thus, we are dealing with the frequency of an event given that or 
conditioned upon the fact that another event has occurred. Note that 


NAB _ nap/n ~ P[AB] 
na na/n — PJA] 








(1.6-1) 


This empirical concept suggests that we introduce in our theory a conditional probability 
measure. 


tE, F, G are any three events defined on the same probability space. 
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Conditional probability. The conditional probability P|B|A] is defined by 


P[B\A] Ê PA, if P[A] > 0, (1.6-2) 
and is read as “the probability that event B occurs given that event A has occurred.” 
Similarly we have 

lala] è T if P[B] > 0. (1.6-3) 
Definitions 1.6-2 and 1.6-3 can be used to compute the joint probability of AB since 

P[AB] = P[A|B]P{[B] 


= P[B|A]PA]. 


Independence. 


Definitions (independence of events) (i) Two events A € Z B € F with PA] > 0, 
P[B] > 0 are said to be independent if and only if (iff) 


P[AB] = P[A]P[B]. (1.6-4) 


Since, in general, P[AB] = P[B|AIP[A] = P[A|B]P[B] it follows that for independent 
events 
P[A|B] = P[A], (1.6-5a) 


P|B|A] = P[B]. (1.6-5b) 


Thus, the definition satisfies our intuition: If A and B are independent, the outcome B 
should have no effect on the conditional probability of A and vice versa. 

(ii) Three events A, B, C defined on Z and having nonzero probabilities are said to be 
jointly independent iff 


P[ABC] = P[A]P[B]PIC], (1.6-6a) 
P[AB] = P[A]P[B], (1.6-6b) 
P[AC] = P[A|P[C], (1.6-6c) 
P[BC] = P{B)P{C]. (1.6-6d) 


This is an extension of (i) above and suggests the pattern for the definition of n independent 
events Ai,...,An. Note that it is ‘not sufficient’ to have just P[ABC] = P[A]P[B]P[C]. 
Pairwise independence must also be shown. 
(iii) Let A;, i = 1,...,n, be n events contained in Z The {A;} are said to be jointly 
independent. iff 

P[A:A;] = P[A:]P[A;] 
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P[A;A; Ax] = P[AJP[A;]P[Ax] 


P[A,... An] = P[Ai]P[Ae] ... P[An] 
for all combination of indices such that l<i<j<k<...<n. B 


Example 1.6-1 
(Sic bo) The game Sic bo is played in gambling casinos. Players bet on the outcome of a 
throw of three dice. Many bets are possible each with a different payoff. We list two of them 
below with the associated payoffs in parentheses: 





(1) Specified three of a kind (180 to 1), that is, pre-specified by the bettor; 
(2) Unspecified three of a kind (30 to 1), that is, any three-way match. 


What are the associated probabilities of winning from the bettor’s point of view and 
his expected gain. 


Solution 


(1) (specified three of a kind) Let E; be the event that the specified outcome appears on 
the ith toss. Then the event that three of a kind appear is E1 £23 with probability 
P(E, £2 E3] = P[E1]P [E2] P[E3] = 1/216, where we have used the fact that the three 
events are independent since they refer to different tosses. A fair payout would thus 
be 216 to 1, not 180 to 1. 

(2) (unspecified three of a kind) On the first throw any number can come up. On the 
next two throws, numbers that match the first throw must come up. Hence P|three 
unspecified, = 1 x 1/6 x 1/6 = 1/36. A fair payout is thus 36 to 1, not 30 to 1. 





Example 1.6-2 
(testing three events for independence) An urn contains 10 numbered black balls (some 
even, some odd) and 20 numbered white balls (some even, some odd). Some of the balls 
of each color are lighter in weight than the others. The exact composition of the urn is 
shown in the tree diagram of Figure 1.6-1. The outcomes are triples ¢ =(color, weight, 
number). The sample space Q is the collection of all these triples. Each draw is completely 
random. 

Let A denote the event of picking a black ball, B denote the event of picking a light 
ball, and C denote the event of picking an even-numbered ball. Are A, B, C independent 
events? 


Solution We first test whether P[ABC] = P[A]P[B]P[C]. Now P[A] = 1/3 since 1/3 of 
the balls are black, P[B] = 1/2 since from the tree diagram we see that 15/30ths of the 
balls are light, and P[C] = 2/5 since 12/30 balls are even numbered. Now P[ABC] = 2/30 
since the event ABC is black, light, and even and there are only two of them. Multiplying 
out we find that P[ABC] = P[A]P[B]P[C]. So the three events pass this part of the test 
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Figure 1.6-1 Diagram of composition of urn. 
for independence. However, for (full) independence, we must also have P[AB] = P[A|P[B], 
P[AC] = P[A]P[C], and P[BC] = P[B]P[C}. Note that P[AC] = 2/30 while P[A]P[C] = 
1/3 x 12/30 = 2/15 # 2/30. Hence A, B, and C are not jointly independent. 





Compound Experiments 


Often we need to consider compound experiments or repeated trials. If we have a probability 
space defined for the individual experiments, we would like to see what this implies for the 
complete or compound experiment. There are two cases to consider, to model the physical 
fact that often the repeated trials seem to be independent of one another, while in other 
important cases the outcome seems to depend on the prior outcomes of earlier trials. 


Independent experiments. Consider two independent experiments, meaning that the 
outcome of one is not affected by past, present, or future outcomes of the other. Let each 
have its own sample space Q, outcomes Ç, events E, and probability measure P. Specifically, 
we have 


Çı E€ Ei CQ, with measure P) and Ca € Ez C Ne with measure Py, 


as illustrated in Figure 1.6-2. 
We want to be able to work with compound experiments, meaning that the sample 
space of the compound experiment is the Cartesian product of the two sample spaces, 


NÊN xN 
with vector outcomes (elements) ¢ = (€),¢2.) EE CR. 


Example 1.6-3 
(flip two coins) Let two experiments each consist of flipping a two-sided coin, with the two 
sides denoted H and T. Then we have Qı = {H, T} = Q2. In the compound experiment, 
we have Q = {(H, H), (H, T), (T, H), (T, T)}. We could also just as well write the outcomes 
¢ €Q as strings of characters H and T rather than vectors. In that notation, we have 








36 Chapter 1 Introduction to Probability 





> 


2 


Figure 1.6-2 Two compound probabilistic experiments. 


Q = {HH, HT, TH, TT}. Considering event H, = {T} in the first experiment, and event 
Ez = {H} in the second experiment, we have the event E = {TH} = E, x Ez C © in the 
compound experiment. 

When we write a set with cross-product notation, we mean 


Ey x Ey Ê {6 = (61,0); € Fa and ¢, € E2}. 


So the elements in the cross-product of two sets are all the possible ordered pairs of elements, 
one from each set. 


Example 1.6-4 
(toss two dice) Let the two experiments now each consist of tossing a die, with the six 
faces (up) being denoted as outcomes 1-6. Then we have 2; = {1,2,3,4,5,6} = Q2. In 
the compound experiment, we have as outcomes the pair (or vector) elements of the cross- 
product sample space Q = {11, 12, ..., 16, 21, ..., 26, ...,61, ...,66} = Q4 x Q2. Note that now 
all events (subsets of 2) are not of the form Fy x Ez. In fact this is a special case. Consider 
the event {11, 12,31}, for example. It is missing the outcome 32 contained in {1,3} x {1, 2}. 
However, we can write this event as a disjoint union over set cross products 


{11,12,31} = {a} x {1,2} uU {{3} x {a}. 











Often we are interested in joint models for physical experiments that are independent of 
each other. This requires a definition. Thus, we define mathematically that two compound 
experiments are independent if the probabilities of events E can be expressed in terms of 
the individual probability measures P) and P2. 


Definition 1.6-1 Two experiments are said to be independent if (i) for a cross- 
product event E = FE, x Ez, we can write 


PIE] Ê P,[E:]Po[E2], 
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(ii) the probability of a general event E in the compound experiment, can be written, in 
terms of singleton events, as 


PE Y PAC HPALCCaHl. 


(61.02) €F 


We can generalize this concept to combining n experiments to get the compound exper- 
iment’s sample space 


and vector (string) outcomes Ç = (¢,,...,¢,,) E E C Q, the compound experiment’s sample 
space. W 


Example 1.6-5 ~~ > S 
(three experiments) Consider three independent experiments, each with its own sample 
space Q;, i = 1,2,3.. Let E; be any arbitrary event in Q;. Then the general cross-product 
events E = E x Eo x Ez in the compound experiment would have probabilities 


PIE, x Ez x E3] = Pi [E1] P2[E2] P; [E3], 


where, the events E; would be made up from unions and intersections of the measurable 
subsets of Q;. 


Example 1.6-6 
(repeated coin flips) Consider flipping a coin n times. Each flip can be considered a random, 
independent, experiment. Let the individual outcomes in each experiment be denoted H 
and T then the outcomes in the compound experiment are strings of H and T of length n. 
There are 2” distinguishable ordered strings. The probability of a string having k H and 
n—k T is given by 





Pl(Cr 1 Gn)] = I PAC] 


— pg E, 


where p and q 21- p, with 0 < p < 1, are the individual probabilities of H and T, 
respectively, on a single coin flip. 


We can also express these compound probabilities in terms of general events rather than 
singleton events. Again consider two experiments with probability spaces 


Çı € E C Qi with measure P) and Ç, € Ey C Qo with measure P3; 
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then the compound experiment consists of the probability space 


Ê (6,,¢)) € ECO with measure P, 


where the compound probability measure P is defined for event E C Q as follows. First, we 
must write the compound event in E as a disjoint union of cross-product events from the 


two experiments k 


E = |] Er x Ezi, 
i=1 
for some positive integer k, where Fi, and Ez; are events in 9; and Qe, respectively. In 
the simplest case Æ will itself be a cross-product event, and we will have k = 1, but as 
we have seen in Example 1.6-4, it will generally be necessary to take the union of several 
cross-product events to express an arbitrary event E in the compound sample space. 


Definition 1.6-2 (alternative) Then when we say that the erperiments are indepen- 
dent, we mean that for any event E in the compound experiment, 


k k 
PIE} 2 SOP [Fi i] Pe [E23], where E= U Eii x Ezi, 
i=1 i=1 
a disjoint union, and where the E1, and Ez; are events in Qı and Q3, respectively. Here k 
is the number of cross-product events necessary to express compound event E. 

We note that additivity of probability is appropriate since the events are disjoint. We 
can see immediately that this alternative definition is consistent with the definition in 
terms of elementary or singleton events given above. To see this simply take FE, and 
Ez; as singleton events. Clearly this more general approach can also be extended to n > 
2 experiments straightforwardly. We next turn to the more complicated case of multiple 
dependent experiments. 


Dependent experiments.* Consider two “dependent experiments,” meaning that the 
second experiment’s probabilities will depend on the event that occurs in the first exper- 
iment. Let’s say the first experiment consists of outcomes ¢,;, where i = 1,...,k, whose 
probabilities P,[{¢, ;}] are given. The probability measures for the second experiment must 
be parametrized with index i from the first experiment, that is, 

P,;[E2] for each event Ez C Q2, 


where Qz is the sample space for the second experiment. This is illustrated in Figure 1.6-3. 
Then we write the probability measure for the compound experiment as follows. 


Definition 1.6-3 (dependent experiments) Let E = {¢,,,} be a singleton event in Qı 
for some i, and let Ez be an event in 02; then consider the cross-product event E = E x Ez 
in the compound experiment. We then write 


PIE] Ê Pall} P2c[Eal, 


where the probability measure in the second experiment is a function of the outcome in the 
first experiment. E 


*Starred material can be omitted on a first reading. 
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Figure 1.6-3 Two dependent compound experiments. 


First we note that this definition is consistent with the definition above for the case of 
independent experiments. This is because in the case of independent events all the P2, are 
the same, that is, P2; = Po, for all 7. 

More generally, let the event in the first experiment be E = U,;{¢,,;}, that is, the union 
of i elementary (singleton) events; then the probability of the compound event E = EB, x Ez 
is written as 


PIE] 2S Pills} P2slEa]- 


Here additivity makes sense since only one of the i elementary events {¢ Lil can occur in 
the first experiment. 





Example 1.6-7 
(flip biased coins) Let there be three biased coins considered. We flip the first one, with 
pı = P,[{H}]. Depending on the outcome, H or T, we then flip coin 2 or coin 3, respectively. 
Assume for coin 2 that the probability p2 = P2[{H}], and that for coin 3, we have ps = 
P3[{H}]. Here, of course, we assume that all the p; satisfy 0 < p; < 1. Then for po Æ p3, we 
have the case of dependent experiments. Computing, for example, P{HT}, we get p,(1—p2) 
and for P{TH}, we get (1 — p1)ps, etc. 





Example 1.6-8 
(conditioning on events) Consider that the weather today can be sunny, cloudy, or rainy with 
probabilities p1 s, P1,c, and pı, r, respectively, where these three sum to one. Then tomorrow, 
it may be also sunny, cloudy, or rainy, and that may depend on what happened today. So the 
conditional probability for the weather tomorrow can depend on these conditioning events, 
and would be expected to be a different measure for each one. We would have a set of three 
conditional probability measures for day 2, one for each condition from day 1. 
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Relation to conditional probability. Consider a compound experiment with two compo- 
nent experiments (0,71, P1) and (Q2, F2, P2) that are independent, so that we have the 
compound experiment (N, F, P) with 


PIE, x Ez] = P, [E1] P2[E2], 


for all cross-product events E x E2 E€ F, where E € F and E2 € Fo. We can think of the 
first experiment as occurring before the second one. Let the conditioning event B € F be 
of the form B = Bı x Q2, where Bı € Fı. Then P[B] = Pi[B1]: 1. Similarly, let the event 
A € F be of the form A = Qı x Az, where Az € Fz; then P[A] = 1 - P,[Ag], and we find 
that the conditional probability 
P|AB] 
P[A]B] = —— 
_ P[(® x A2) N (Bi x Q2)| 
P,[By] 


_ P[B, x Ag] 
~ P, [By] 


_ Pi [Bi] P2 [A2] 
~  P\[By] 


= P2[Ag], 





where we have noted that 


(Qi x A2) N (Br x Q2) = {(61, C2) Co © A2} N {(01,0a) Cr E€ Bi} 
= Bı x Ad. 


Now this is what we expect to happen for two independent experiments. But, what 
happens when the two experiments are dependent? 


*Example 1.6-9 
(dependent case) Consider a compound experiment with two components as above, that is, 
B = B; x Qz and A = Q; x Ag, but now assume that these experiments are dependent. 
Assume the number of outcomes in the first experiment to be a finite number k and write 
the probability measure of the second experiment as a function of the outcome on the first 
experiment, that is, Pz; for each outcome Çi; E 9) for i = 1,...,4. Assume also that 
Bı = {¢1,:} for some value i. Then proceeding as in the last example, we have 


PQ x A2) N (Bı x Q2)] 
P, [Bı] 


_ P [Bi] Pz:lA2] 
P, [Bı] 





P[A|B] = 





= P2;[A2], as expected. 
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Example 1.6-10 
(communication channel and source) In a binary communication system, we have a binary 
source S along with a binary channel C (Figure 1.6-4) defined in terms of its conditional 
probabilities. The sample space Q for this combined experiment is Q = {Ç = (a,y): z = 
and y = 0 or 1} = {(0,0), (0,1), (1,0), (1,1)}, where x denotes the source output that is 
the channel input, and y denotes the channel output. The joint probability function is then 
given as P[{(z,y)}] = Ps[{z}]Po[{y}|{z}] z, y = 0,1, where Ps is the probability measure 
of the source S' and Po is the conditional probability measure of the channel C. 

Because of noise a transmitted zero sometimes gets decoded as a received one and vice 
versa. From repeated use of the channel, it is known that 


Pol{O}l{0}] =0.9,  Pel{1}{0}] = 0.1, 
Pol[{O}l{1}] =0.1, — Pel} = 0.9, 


and by design of the source Ps({0}] = Ps{{1}] = 0.5.1 The various probabilities of the 
singleton events in the joint experiment are then 


PI{(0,0)}] = Po[{O}|{0}] Ps[{O}] = 0.45 
PI{(0,1)}] = Pe[{1}]{0}] Ps[{O}] = 0.05 
P({(1,0)}] = Pe[{O}l{1}]Ps[{1}] = 0.05 
PKG, D3] = PeKi {1Ps [{1}] = 0.45. 


We can also define some events on the compound or combined sample space 





Xo £ “event that xz = 0” and Xı Ê “event that z = 1” 


I> 


Yo = “event that y = 0” and Yj Ê “event that y=” 
and rewrite the above channel conditional probabilities as 


binary channel 


Figure 1.6-4 A binary communication system. 


tIt is good practice to design a code in which the zeros and ones appear at close to the same rate since 
this puts the signaling capacity of the channel to greatest use. 
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The source probabilities are then just expressed as 
P|Xo] =0.5 and P[X,] =0.5. 
In the combined experiment, the above joint probabilities become 
P[X0 U Yo] = P[Yo|Xo]P[Xo] = 0.45 
P[Xo U Y1] = P[Yi|Xo] P[Xo] = 0.05 
P[X1 U Yo] = P[¥o|Xi)P[X1] = 0.05 
PIX: UY] = P[¥i} Xi] P[Xo] = 0.45. 








The introduction of conditional probabilities raises the important question of whether condi- 
tional probabilities satisfy Axioms 1 to 3. In other words, given any two events E, F such 
that EF = ¢ and a third arbitrary event A with P[A] > 0, all belonging to the o-field of 
events ¥in the probability space (0,.% P), does 


P[E|A] > 0? 
PIRJA] = 1? 
P|EUF|A] = P[E|A]+ P[F|A] for EF = ¢? 


The answer is yes. We leave the details as an exercise to the reader. They follow directly 
from the definition of conditional probability and the three Kolmogorov axioms. 


Example 1.6-11 
(probability trees) Three events A, B, and C are often specified in terms of conditional 
probabilities as follows: - 





P[A], P[B|A], P[B|A‘] and 
PIC|BA], P[C|BA‘], P[C|B° 4], P[C|B°A’]. 


In such a case the problem can be summarized in a tree diagram, such as Figure 1.6-5, where 
the branches are labeled with the relevant conditional probabilities and the node values are 
the corresponding joint probabilities. Here the root node can be thought of as having value 
1.0 and being associated with the certain event Q. If we want to evaluate the probability 
of an event on a leaf (the last set of nodes) of the tree, we just multiply the conditional 
probabilities on its path. 

A way this can arise is if the events come from compound experiments conducted 
sequentially, so that the event B depends on the event A, and in turn the event C depends on 
them both. A more general tree would have more than two outgoing branches at each node 
indicating more than two events were possible, for example, A1, Ao,..., An- The conditional 
probabilities can be stored in a data structure in a machine, which could be queried for 
answers to various joint probability questions, such as: What is the probability of the joint 
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P[CBA] 


P[C'BA] 
P{CBA] 
PICBA] 
P[CBA‘] 


P{CBA ] 


PICB4$ 





PICBA) 


Figure 1.6-5 A probability tree diagram with conditional probabilities on the branches and joint 
probabilities at the nodes. 


event C;,B,A, which could be answered by tracing the corresponding path in the stored 
data structure and then multiplying the values on its branches? For a concrete example, 
take first round A, events to indicate the health (good, fair, poor) of a plant purchased 
at a local nursery, then B; can indicate its health one week later, and Cp can indicate the 
health at two weeks from purchase. 





The next example, illustrating the use of joint and conditional probabilities, has appli- 
cations in real life where we might be forced to make important decisions without knowing 
all the facts. 


Example 1.6-12 — > > 
(beauty contest)? Assume that a beauty contest is being judged by the following rules: 
(1) There are N contestants not seen by the judges before the contest, and (2) the contestants 
are individually presented to the judges in a random sequence. Only one contestant appears 
before the judges at any one time. (3) The judges must decide on the spot whether the 
contestant appearing before them is the most beautiful. If they decide in the affirmative, 
the contest is over but the risk is that a still more beautiful contestant is in the group as yet 
not displayed. In that case the judges would have made the wrong decision. On the other 
hand, if they pass over the candidate, the contestant is disqualified from further considera- 
tion even if it turns out that all subsequent contestants are less beautiful. What is a good 


tThanks are due to Geof Williamson and Jerry Tiemann for valuable discussions regarding this problem. 
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Figure 1.6-6 The numbers along the axis represent the chronology of the draw, not the number 
actually drawn from the bag. 


strategy to follow to increase the probability of picking the most beautiful contestant over 
that of a random choice? 


Solution To make the problem somewhat more quantitative, assume that all the virtues 
of each contestant are summarized into a single “beauty” number. Thus, the most beautiful 
contestant is associated with the highest number, and the least beautiful has the lowest 
number. We make no assumptions regarding the distribution or chronology of appearance 
of the numbers. The numbers, unseen by the judges, are placed in a bag and the numbers are 
drawn individually from the bag. We model the problem then as one of randomly drawing 
the “beauty” numbers from a bag. We consider that the draws are ordered along a line as 
shown in Figure 1.6-6. Thus, the first draw is number 1, the second is 2, and so forth. At 
each draw, a number appears. Is it the largest of all the N numbers? 

Assume that the following “wait-and-see” strategy is adopted: We pass over the first k 
draws (i.e., we reject the first k contestants) but record the highest number (i.e., the most 
beautiful contestant) observed within this group of k. Then we continue drawing numbers 
(i.e., call for more contestants to appear). The first draw (contestant) after the k passed-over 
draws that yields a number exceeding the largest number from the first k draws is taken to 
be the winner. If a larger number does not occur, then the judge declines to vote and we 
count this as an error. 

Let us define E;(k) as the event that the largest number that is drawn from the first j 
draws occurs in the group of first k draws. Then for j < k, E;(k) = Q (the certain event), 
but for j > k, Ej(k) will be a proper subset of Q. Let 2 denote the draw that will contain 
the largest number among the N numbers in the bag. Then two events must occur jointly 
for the correct decision to be realized. (1) (obvious) {x > k}; and (2) (subtle) £;(k) for all 
j such that k < j < x. Then for a correct decision C to happen, the subevent {x = j + 1} 
must occur jointly with the event E;(k) for each j such that k < j < N. The event {x > k} 
can be resolved into disjoint subevents as 


{a>k}={a#=k+1U{e=—k4+2}U...U{2=N}. 
Thus, 


C = {z = k + 1, E(k} U {z =k + 2, Ep41 (k)} -.. U {x = N, Ey_i(k)}, 
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and the probability of a correct decision is 


N-1 
P[C]= > Pix = j +1, E;(k)], because these events are disjoint, 
j=k 
N-1 
= > P[E;(k)z =j + Plz =5 +1] 


where we have used the fact that P[z = j + 1] = $ since all N draws are equally likely to 
result in the largest number. Also P[E;(k)|z =j+1]= E since the “largest” draw from the 
first j draws could equally likely be any of the first j draws, and so the probability that it 
is in the first k of these j draws is given by the fraction 5, 


By the Euler summation formulat for large N 


z 
L 





1 k 

N 4 
_&k 1 1 + 1 44 1 
O N\k k+l k+2 ©” N-1 
k [dr 
~ wie Pa for k large enough, 
-EnA 
ON k’ 


Neglecting the integer constraint, an approximate best choice of k, say ko, can be found by 
differentiation. Setting 


adP[C] _ 
dk’ 
we find that 
N 
ko = —. 
e 
Invoking the integer constraint we round ko to the nearest integer, as to finally obtain 
N 1 
kox |— +-+ 
0 | e + 3l , 


tSee, for example, G. F. Carrier et al., Functions of a Complex Variable (New York: McGraw-Hill, 
1966), p. 246, or visit the Wikipedia page: Euler-Maclaurin formula (http://en.wikipedia.org/wiki/ 
Euler%E2%80%93Maclaurin_formula). 
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where |:| denotes the least-integer function. The maximum probability of a correct decision 
P[C] then becomes 


~ 1 lne = 0.367. 
e 


Thus, we should let approximately the first third (more precisely 36.7 percent) of the contes- 
tants pass by before beginning to judge the contestants in earnest. We assume that N is 
reasonably large for this result to hold. The interesting fact is that the result is indepen- 
dent of (large) N while the probability of picking the most beautiful candidate by random 
selection decreases as 1/N. 





Here are some other situations that require a strategy that will maximize the probability 


of making the right decision. 


1. You are apartment-hunting and have selected 30 rent-controlled flats to inspect. You 


see an apartment that you like but you are not ready to make an offer because you 
think that the next apartment to be shown might be more desirable. However, none 
of the subsequent apartments that you visit measure up to the first. Sadly, your offer 
for that apartment is rejected because, meanwhile, someone else rented it. You will 
have to settle for a far lesser desirable apartment because you hesitated. 


. You are looking for a partner to spend the rest of your life with. To that end you 


contract with a singles dating agency to meet 50 possible life partners at the rate of 
one date per week. On your ninth date, you decide that you have found your life’s 
partner and offer marriage, which is accepted. However, you forget to tell the dating 
agency to stop introducing you to additional partners. The following week you are 
introduced to a date that in all qualities surpasses your chosen one. You kick yourself 
for having acted too impulsively. 


. You are interviewing candidates for a high-level position in the government. To reduce 


the possibility of discrimination on your part you are bound by the following rules: 
You are to interview the candidates in sequence and offer the job to the first candidate 
who is qualified according to the job description. If you reject a candidate it means 
that he/she was not qualified and so you must state in writing in your report. However, 
you are savvy enough to know that even among the qualified candidates there will be 
those that are superbly qualified while others will be merely qualified. You want to 
hire the best person for the job. What should your strategy be? 


Total Probability. In many problems in engineering and science we would like to compute 
the unconditional probability P[B] of an event B in terms of the sum of weighted 
conditional probabilities. Such a computation is easily realized through the following 
theorem. 
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Theorem 1.6-1 Let Aj, Ao,...,An be n mutually exclusive events such that 
Ur, 4: = © (the A,’s are exhaustive). Let B be any event defined over the probability 
space of the A,’s. Then, with P[A;] 4 0 all å, 


P|B] = P[B|Ay|P[Ai] +... + P[B|An]P[Aal- (1.6-7) 


Sometimes P[B] is called the total probability of B because the expression on the right is a 
weighted average of the conditional probabilities of B. 


Proof We have A,A; = ¢ for alli # j and Uj; Ai =. Also BQ = B= BU, Ai = 
Up, BAi. But by definition of the intersection operation, BA; C A;; hence (BA;)(BA;) = ġ 
for all i # j. Thus, from Axiom 3 (generalized to n events): 


P|B] = P lu ba = P[BA\] + P[BA] +...+ P[BAn] 


= P[B|A,|P[Ai] +...+ P[B|An|P[An]- (1.6-8) 
The last line follows from Equation 1.6-2. E 


Example 1.6-13 
(more on binary channel) For the binary communication system shown in Figure 1.6-4, 
compute the unconditional output probabilities P[Yo] and P[Yi]. 





Solution Continuing with the notation of binary communication Example 1.6-10, we use 
Equation 1.6-8 as follows: 


P[Yo] = P[Yo|Xo]P[Xo] + P[¥o|Xi]P[X3] 
= Po[0\0]Ps[0] + Pe[0]1}Ps[1]t 
= (0.9)(0.5) + (0.1)(0.5) 
= 0.5. 


We can compute P[Y,] in a similar fashion or by noting that YoUY, = Q and Yon Yı = ¢; 
that is, they are disjoint. Hence P[Yo] + P[Y1] = 1, implying P[Y1] = 1 — P[Yo] = 0.5. 


1.7 BAYES’ THEOREM AND APPLICATIONS 


The previous results enable us now to write a fairly simple formula known as Bayes’ 
theorem.‘ Despite its simplicity, this formula is widely used in biometrics, epidemiology, 
and communication theory. 


t For notational ease, we have abbreviated these terms by leaving off the curly brackets. We retain the 
square brackets for probabilities P through to remind that they are set functions. 
tNamed after Thomas Bayes, English mathematician/philosopher (1702-1761). 
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Bayes’ Theorem Let A;, i = 1,...,n, be a set of disjoint and exhaustive events 
defined on a probability space Z Then, U; Ai = Q, A;A; = ¢ for all i # j. With B any 
event defined on Awith P[B] > 0 and P[A;] 4 0 for all z 


P[B\A;| PIA] 


(1.7-1) 
p2 P[B|A;]P[A:] 


P[A;|B] = 


Proof The denominator is simply P[B] by Equation 1.6-8 and the numerator is simply 
P[A;B]. Thus, Bayes’ theorem is merely an application of the definition of conditional 
probability. 


Remark In practice the terms in Equation 1.7-1 are given various names: P{A,|B] 
is known as the a posteriori probability of A; given B; P[B|Aj] is called the a priori 
probability of B given A;; and P[A;] is the causal or a priori probability of A;. In general 
a priori probabilities are estimated from past measurements or presupposed by experience 
while a posteriori probabilities are measured or computed from observations. 


Example 1.7-1 
(inverse binary channel) In a communication system a zero or one is transmitted with 





Ps[0] = po, Ps[1] = 1 — po 4 pı, respectively. Due to noise in the channel, a zero can be 
received as a one with probability 8, called the cross-over probability, and a one can be 
received as a zero also with probability 8. A one is observed at the output of the channel. 
What is the probability that a one was output by the source and input to the channel, that 
is, transmitted? 


Solution The structure of the channel is shown in Figure 1.7-1. We write 





Pals! r (1.7-2) 
_ Poipu] l 
— Pell] Ps[1] + Pe[110]Ps{[0] (1.7-3) 
- mU-p) _ i 
~ p(l- B) + pop (1.7-4) 
0.9 
x=0, Py y= 
x=1, p y=1 


Figure 1.7-1 Representation of a binary communication channel subject to noise. 
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If pọ =m = Z, the inverse or a posteriori probability P[X1|Yı] depends on 8 as shown in 
Figure 1.7-2. The channel is said to be noiseless if 8 = 0, but notice that the channel is just 
as useful when 8 = 1. Just invert the outputs in this case! 


Example 1.7-2 
(amyloid test for Alzheimer’s disease) On August 10, 2010 there was a story on network 
television news that a promising new test was developed for Alzheimer’s disease. It was 
based on the occurrence of the protein amyloid in the spinal (and cerebral) fluid, which 
could be detected via a spinal tap. It was reported that among Alzheimer’s patients (65 
and older) there were 90 percent who had amyloid protein, while among the Alzheimer’s 
free group (65 and older) amyloid was present in only 36 percent of this subpopulation. 
Now the general incidence of Alzheimer’s among the group 65 and older is thought to be 
10 percent from various surveys over the years. From this data, we want to find out: Is it 
really a good test? 

First we construct the probability space for this experiment. We set Q = {00, 01, 10, 11} 
with four outcomes: 





00 = “no amyloid” and “no Alzheimer’s,” 
01 = “no amyloid” and “Alzheimer’s,” 
10 = “amyloid” and “no Alzheimer’s.” 


11 = “amyloid” and “Alzheimer’s.” 


On this sample space, we define two events: A 4 {10,11} = “amyloid” and 


BÊ {01,11}= “Alzheimer’s.” From the data above we have 


P[A|B]=0.9 and P{A|B*] = 0.36. 


PIX, |Y; ] 


1 


0 1 B 


Figure 1.7-2 A posteriori probability versus 8. 
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Also from the general population (65 and greater), we know 
P|B]=0.1 and P[B*]=1—-P{[B] = 0.9. 


Now to determine if the test is good, we must look at the situation after we give the test, 
and this is modeled by the conditional probabilities after the test. They are either P[-|A]J, 
if the test is positive for amyloid, or P[-|A‘], if the test is negative. So, we can use total 
probability to find answers such as P|B|A]. We have 


P[A|B]P[B] 
P[A|B]P[B] + P[A|Bo]P[B-] 
0.9 x 0.1 
= 0.9 x 0.1 + 0.36 x 0.9 


= 0.217. 


P[B\A] = 








So, among the group that tests positive, only about 22 percent will actually have Alzheimer’s. 
The test does not seem so promising now. Why is this? Well, the problem is that we are never 
in the “knowledge state” characterized by event B where conditional probability P|-|B] is 
relevant. Before the test is given, our knowledge state is characterized by the uncondi- 
tional probabilistic knowledge P[-]. After the test, we have knowledge state determined by 
whether event A or A° has occurred; that is, our conditional probabilistic state is either 
P[-|A] or P[-|A‘°]. You see, we enter into states of knowledge either “given A” or “given A°” 
by testing the population. So we are never in the situation or knowledge state where P|-|B] 
or P[-|B°] is the relevant probability measure. So the given information P[A|B] = 0.9 and 
P[A|B‘] = 0.36 is not helpful to directly decide whether the test is useful or not. This is 
the logical fallacy of reasoning with P[A|B] instead of P[B|A], but there is another very 
practical thing going on here too in this particular example. 

When we calculate P[B¢|A] = 1.0 — 0.217 = 0.783, this means that about 78 percent of 
those with positive amyloid tests do not have Alzheimer’s. So the test is not useful due to 
its high false-positive rate. Again, as in the previous example, the scarcity of Alzheimer’s 
in the general population (65 and greater) is a problem here, and any test will have to 
overcome this in order to become a useful test. 








1.8 COMBINATORICS! 


Before proceeding with our study of basic probability, we introduce a number of counting 
formulas important for counting equiprobable events. Some of the results presented here 
will have immediate application in Section 1.9; others will be useful later. 


tThis material closely follows that of William Feller [1-8]. 
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A population of size n will be taken to mean a collection (set) of n elements without 
regard to order. Two populations are considered different if one contains at least one element 
not contained in the other. A subpopulation of size r from a population of size n is a subset of 
r elements taken from the original population. Likewise, two subpopulations are considered 
different if one has at least one element different from the other. 

Consider a population of n elements a1, a2,...,@,. Any ordered arrangement ak, akz, 
..-,@,, of r symbols is called an ordered sample of size r. Consider now the generic urn 
containing n distinguishable numbered balls. Balls are removed one by one. How many 
different ordered samples of size r can be formed? There are two cases: 

(i) Sampling with replacement. Here after each ball is removed, its number is recorded 
and it is returned to the urn. Thus, for the first sample there are n choices, for the second 
there are again n choices, and so on. Thus, we are led to the following result: For a population 
of n elements, there are n” different ordered samples of size r that can be formed with 
replacement. 

(ii) Sampling without replacement. After each ball is removed, it is not available 
anymore for subsequent samples. Thus, n balls are available for the first sample, n — 1 
for the second, and so forth. Thus, we are now led to the result: For a population of n 
elements, there are 

(n) & n(n —1)(n—2)...(n—-r+1) 


n! 


= (n—r)! (1.8-1) 


different ordered samples of size r that can be formed without replacement? 

The Number of Subpopulations of Size r in a Population of Size n. A basic problem 
that often occurs in probability is the following: How many groups, that is, subpopulations, 
of size r can be formed from a population of size n? For example, consider six balls numbered 
1 to 6. How many groups of size 2 can be formed? The following table shows that there are 
15 groups of size 2 that can be formed: 


12 23 34 45 56 
13 24 35 46 


14 25 36 
15 26 
16 


Note that this is different from the number of ordered samples that can be formed 
without replacement. These are (6-5 = 30): 


12 21 31 41 #51 61 
13 23 32 42 52 62 
14 24 34 43 53 63 
15 25 35 45 54 64 
16 26 36 46 56 65 


tDifferent samples will often contain the same subpopulation but with a different ordering. For this 
reason we sometimes speak of (n), ordered samples that can be formed without replacement. 
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Also it is different from the number of samples that can be formed with replacement 
(62 = 36): 
11 21 31 41 #51 61 
12 22 32 42 52 62 
13 23 33 43 53 63 
14 24 34 44 54 64 
15 25 35 45 55 65 
16 26 36 46 56 66 


A general formula for the number of subpopulations, C? of size r in a population of 
size n can be computed as follows: Consider an urn with n distinguishable balls. We already 
know that the number of ordered samples of size r that can be formed is (7),. Now consider 
a specific subpopulation of size r. For this subpopulation there are r! arrangements and 
therefore r! different ordered samples. Thus, for C? subpopulations there must be C7 - r! 
different ordered samples of size r. Hence 


or 





Cn = mro _ ™ a (z) . (1.8-2) 


rl (n—ryir!\r 


Equation 1.8-2 is an important result, and we shall apply it in the next section. The symbol 


nafn 
a8 (7) 
is called a binomial coefficient. Clearly 


(*) = ag er 2 nir) = (1.8-3) 


We already know from Section 1.4 that the total number of subsets of a set of size n is 2”. 


The number of subsets of size r is (*). Hence we obtain that 


S(r 


r=0 


A result which can be viewed as an extension of the binomial coefficient C? is given by the 
following. 


Theorem 1.8-1 Let 1,...,7; be a set / of nonnegative integers such that rı +r2+... 
+r; =n. Then the number of ways in which a population of n elements can be partitioned 
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into | subpopulations of which the first contains rı elements, the second rz elements, and 
so forth, is 
n! 

——. 1.8-4 

riral... ry! ( ) 
This coefficient is called the multinomial coefficient. Note that the order of the subpopulation 
is essential in the sense that (rı = 7, re = 10) and (rı = 10, rg = 7) represent different 
partitions. However, the order within each group does not receive attention. For example, 
suppose we have five distinguishable balls (1,2,3,4,5) and we ask how many subpopulations 
can be made with three balls in the first group and two in the second. Here n = 5, rı = 3, 
ro = 2, and rı + r2 = 5. The answer is 5!/3!2! = 10 and the partitions are 


Group 1: | 1,2,3 
Group 2:| 4,5 





2,3,4 | 3,4,5 
51 | 1,2 





4,5,1 | 5,1,2 | 2,4,5 | 2,3,5 | 1,3,5 | 1,3,4 | 1,2,4 
23 | 34 | 13 | 14 | 24 | 25 | 35 


Note that the order is important in that had we set rı = 2 and r2 = 3 we would have gotten 
a different partition, for example, 


51 | 1,2 | 2,3 
2,3,4 | 3,4,5 | 4,5,1 


Group 1: 
Group 2: 


45 
1,2,3 











34 | 13 | 14 | 24 | 25 | 3,5 
5,1,2 | 2,4,5 | 2,3,5 | 1,3,5 | 1,3,4 | 1,2,4 


The partition (4,5), (1,2,3) is, however, identical with (5,4), (2,1,3). 
Proof Note that we can rewrite Equation 1.8-4 as 


n! 1 1 


Recalling that 0! = 1, we see that the last term is unity. Then the multinomial formula is 
written as 


n! _{” n=ri n—-T1—T2 n—rTri—T2—...— T2 
rylre!...ry! ~ (2) ( T2 ) ( T3 toe Ti . (1.8-5) 


To affect a realization of rı elements in the first subpopulation, rz in the second, and 
so on, we would select rı elements from the given n, r2 from the remaining n — rı, T3 


from the remaining n — rı — r2, etc. But there are p ) ways of choosing rı elements 
1 


n-r . a 
out of n, ( r ‘) ways of choosing r2 elements out of the remaining n — rı, and so 
2 
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forth. Thus, the total number of ways of choosing rı from n, rz from n — rı, and so on is 
simply the product of the factors on the right-hand side of Equation 1.8-5 and the proof is 
complete. W 


Example 1.8-1 
(toss 12 dice) [1-8, p. 36] Suppose we throw 12 dice; since each die throw has six outcomes 
there are a total of nr = 61? outcomes. Consider now the event E that each face appears 
twice. There are, of course, many ways in which this can happen. Two outcomes in which 
this happens are shown below: 


Dice I.D. Number 
Outcome 1 
Outcome 2 










The total number of ways that this event can occur is the number of ways 12 dice (n = 12) 
can be arranged into six groups (k = 6) of two each (rı = r2 =... = rẹ = 2). Assuming 
that all outcomes are equally likely we compute 


number of ways E can occur 





NE 
PE] = — = 
[E] nr total number of outcomes 


-2 _ 0.003438 
= gaoet = 0-003438. 
The binomial and multinomial coefficients appear in the binomial and multinomial 
probability laws discussed in the next sections. The multinomial coefficient is also important 
in a class of problems called occupancy problems that occur in theoretical physics. 





Occupancy Problems* 


Occupancy problems are generically modeled as the random placement of r balls into n 
cells. For the first ball there are n choices, for the second ball there are n choices, and so 
on, so that there are n” possible distributions of r balls in n cells and each has a probability 
of n~". If the balls are distinguishable, then each of the distributions is distinguishable; if 
the balls are not distinguishable, then there are fewer than n” distinguishable distributions. 
For example, with three distinguishable balls (r = 3) labeled “1,” “2,” “3” and two cells 
(n = 2), we get eight (2%) distinguishable distributions: 


Cell no. 1 | 1 2 3 | 1,2} 1,3 | 2,3 | 1,2,3 | — 
Cell no. 2 | 2,3 2 1 — 1,2,3 


13 |12| 3 
When the balls are not distinguishable (each ball is represented by a“*”), we obtain 
four distinct distributions: 




















Cell no. 1 | *#* | xx | * — 
Cell no. 2 | — +o | kk | kk 
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How many distinguishable distributions can be formed from r balls and n cells? An 
elegant way to compute this is furnished by William Feller [1-8, p. 38] using a clever artifice. 
This artifice consists of representing the n cells by the spaces between n + 1 bars and the 
balls by stars. Thus, 

1 | | | 


represents three empty cells, while 
Jaa] | Jæ] [eae ex | 


represents two balls in the first cell, zero balls in the second and third cells, one in the 
fourth, two in the fifth, and so on. Indeed, with r; > 0 representing the number of balls in 
the ith cell and r being the total number of balls, it follows that 


Ty tg... Tn =T. 


The n-tuple (ri,7r2,...,7n) is called the occupancy and the r; are the occupancy numbers; 
two distributions of balls in cells are said to be indistinguishable if their corresponding 
occupancies are identical. The occupancy of 


is (2,0,0,1,2,0,5). Note that n cells require n + 1 bars but since the first and last symbols 
must be bars, only n — 1 bars and r stars can appear in any order. Thus, we are asking for 
the number of subpopulations of size r in a population of size n — 1+ r. The result is, by 


Equation 1.8-2, 
nt+r—-l1\ _ fn+r-1 
(mtr) a (narod), 089 
Example 1.8-2 


(distinguishable distributions) Show that the number of distinguishable distributions in 





which no cell remains empty is (C — i ). Here we require that no bars be adjacent. There- 


fore, n of the r stars must occupy spaces between the bars but the remaining r — n stars 
can go anywhere. Thus, n — 1 bars and r — n stars can appear in any order. The number of 
distinct distributions is then equal to the number of ways of choosing r — n places in (n — 1) 
bars +(r — n) stars or r — n out of n — 1 +r —n =r — 1. This is, by Equation 1.8-2, 


Example 1.8-3 — S S 
(birthdays on same date) Small groups of people are amazed to find that their birthdays 
often coincide with others in the group. Before declaring this a mystery of fate, we analyze 
this situation as an occupancy problem. We want to compute how large a group is needed 
to have a high probability of a birthday collision, that is, at least two people in the group 
having their birthdays on the same date. 
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Solution We let the n (n = 365) days of the year be represented by n cells, and the r 
people in the group be represented by r balls. Then when a ball is placed into a cell, it fixes 
the birthday of the person represented by that ball. A birthday collision occurs when two 
or more balls are in the same cell. Now consider the arrangements of the balls. The first 
ball can go into any of the n cells, but the second ball has only n — 1 cells to choose from 
to avoid a collision. Likewise, the third ball has only n — 2 cells to choose from if a collision 
is to be avoided. Continuing in this fashion, we find that the number of arrangements that 
avoid a collision is n(n — 1)---(n — r +1). The total number of arrangements of r balls in 


n cells is n”. Hence with P(r, n) denoting the probability of zero birthday collisions as a 
r—1 . 
function of r and n, we find that Po(r,n) = aan) (north) = JJ (1. — ;). Then 1 — Po(r,n) 
i=1 
is the probability of at least one collision. 
How large does r need to be so 1 — Po(r,n) > 0.9 or, equivalently, Po(r,n) < 0.1? 


r—l g 
Except for the mechanics of solving for r in [] (1 — =) < 0.1, the problem is over. We use 
i=l 


a result from elementary calculus that for any real z, 1 — x < e~*, which is quite a good 
1 1, 
l 


. re i 
(1—+) < 0.1 by II e > < 0.1, we get a 


r- 
approximation for z near 0. If we replace [] z 
i=1 


r—i . = r—l r—i 
bound and estimate of r. Since [] e~* = exp{—ż} Ð i} and with use of $ i = r(r — 1)/2, 
i=1 i=1 i=l 
it follows that e~28"("-)) < 0.1 will give us an estimate of r. Solving for r and assuming 
that r? >> r, and n = 365, we get that r ~ 40. So having 40 people in a group will yield a 


90 percent of (at least) two people having their birthdays on the same day. 


Example 1.8-4 
(treize) In seventeenth-century Venice, during the holiday of Carnivale, gamblers wearing 
the masks of the characters in the commedia dell’arte played the card game treize in enter- 
tainment houses called ridottos. 

In treize, one player acts as the bank and the other players place their bets. The bank 
shuffles the deck, cards face down, and then calls out the names of the cards in order, from 
1 to 13—ace to king—as he turns over one card at a time. If the card that is turned over 
matches the number he calls, then he (the bank) wins and collects all the bets. If the card 
that is turned over does not match the bank’s call, the game continues until the dealer calls 
“thirteen.” If the thirteenth card turned over is not a king, the bank loses the game and 
must pay each of the bettors an amount equal to their bet; in that case the player acting 
as bank must relinquish his position as bank to the player on the right. 

What is the probability that the bank wins? 





Solution We simplify the analysis by assuming that once a card is turned over, and there 
is no match, it is put back into the deck and the deck is reshuffled before the next card is 
dealt, that is, turned over. Let A, denote the event that the bank has a first match, that 
is, a win, on the nth deal and W,, denote the event of a win in n tries. Since there are 4 
cards of each number in a deck of 52 cards, the probability of a match is 1/13. In order for 
a first win on the nth deal there have to be n — 1 non matches followed by a match. The 
probability of this event is 
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1\"" 1 


Since A;A; = ¢ for i Æ j, the probability of a win in 13 tries is 





13 11- (22) 
P[W3] = P[A;] = => = 0.647, 
2A = a oR) 
from which it follows that the probability of the event W$; that the bank loses is P[W{,] = 


0.353. Actually this result could have been more easily obtained by observing that the bank 
loses if it fails to get a match (no successes) in 13 tries, with success probability 1/13. Hence 


P{W] = (3) (4) (Ry = 0.353. 


Note that in the second equation we used the sum of the geometric series result: 
D r= la” (cf. Appendix A). 

Points to consider. Why does P|An] > 0 as n — 00? Why is P[W,] > P[An] for all 
n? How would you remodel this problem if we didn’t make the assumption that the dealt 
card was put back into the deck? How would the problem change if the bank called down 
from 13 (king) to 1 (ace)? 








In statistical mechanics, a six-dimensional space called phase space is defined as a 
space which consists of three position and three momentum coordinates. Because of the 
uncertainty principle which states that the uncertainty in position times the uncertainty 
in momentum cannot be less than Planck’s constant h, phase space is quantized into tiny 
cells of volumes v = h’. In a system that contains atomic or molecular size particles, 
the distribution of these particles among the cells constitutes the state of the system. In 
Maxwell—Boltzmann statistics, all distributions of r particles among n cells are equally likely. 
It can be shown (see, for example, Concepts of Modern Physics by A. Beiser, McGraw-Hill, 
1973) that this leads to the famous Boltzmann law 

2aN 
ne) = tkp VE e E/kT (1.8-7) 
where n(e)de is the number of particles with energy between £ and £ + de, N is the total 
number of particles, T is absolute temperature, and k is the Boltzmann constant. The 
Maxwell—Boltzmann law holds for identical particles that, in some sense, can be distin- 
guished. It is argued that the molecules of a gas are particles of this kind. It is not difficult 
to show that Equation 1.8-7 integrates to N. 

In contrast to the Maxwell—Boltzmann statistics, where all n” arrangements are equally 
likely, Bose-Einstein statistics considers only distinguishable arrangements of indistinguish- 
able identical particles. For n cells and r particles, the number of such arrangements is given 


by Equation 1.8-6 
n+r—1 
r 3 
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and each arrangement is assigned a probability 


n+tr—1 T1 
r . 


It is argued that Bose-Einstein statistics are valid for photons, nuclei, and particles of zero 
or integral spin that do not obey the exclusion principle. The exclusion principle, discovered 
by Wolfgang Pauli in 1925, states that for a certain class of particles (e.g., electrons) no two 
particles can exist in the same quantum states (e.g., no two or more balls in the same cell). 

To deal with particles that obey the exclusion principle, a third assignment of proba- 
bilities is construed. This assignment, called Fermi—Dirac statistics, assumes 


(1) the exclusion principle (no two or more balls in the same cell); and 
(2) all distinguishable arrangements satisfying (1) are equally probable. 


Note that for Fermi—Dirac statistics, r < n. The number of distinguishable arrangements 
under the hypothesis of the exclusion principle is the number of subpopulations of size r < n 


in a population of n elements or (*) . Since each is equally likely, the probability of any 
n\72 

one state is (" 

The above discussions should convince the reader of the tremendous importance of 


probability in the basic sciences as well as its limitations: No amount of pure reasoning based 
on probability axioms could have determined which particles obey which probability laws. 


Extensions and Applications 


Theorem 1.5-1 on the probability of a union of events can be used to solve problems of 
engineering interest. First we note that the number of individual probability terms in the 


sum S; is (z . Why? There are a total of n indices and in S;, all terms have i indices. 


For example, with n = 5 and i = 2, S2 will consist of the sum of the terms P;;, where the 
indices ij are 12; 13; 14; 15; 23; 24; 25; 34; 35; 45. Each set of indices in S; never repeats, 
that is, they are all different. Thus, the number of indices and, therefore, the number of 


terms in S; is the number of subpopulations of size į in a population of size n which is (3) 
from Equation 1.8-2. Note that S,, will have only a single term. 


Example 1.8-5 — > Z o S 
We are given r balls and n cells. The balls are indistinguishable and are to be randomly 
distributed among the n cells. Assuming that each arrangement is equally likely, compute 
the probability that all cells are occupied. Note that the balls may represent data packets 
and the cells buffers. Or, the balls may represent air-dropped food rations and the cells, 
people in a country in famine. 


Solution Let E; denote the event that cell i is empty (i = 1,...,n). Then the r balis 
are placed among the remaining n — 1 cells. For each of the r balls there are n — 1 cells to 
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choose from. Hence there are A(r,n — 1) £ (n — 1)" ways of arranging the r balls among 


the n — 1 cells. Obviously, since the balls are indistinguishable, not all arrangements will be 
n+r— 

—1 
are not, typically, equally likely. The total number of ways of distributing the r balls among 
the n cells is n”. Hence 


distinguishable. Indeed there are only ") distinguishable distributions and these 


1 r 
P(E] = (n — 1)" /n" = ( -— =) 4 p, 
Next assume that cells i and j are empty. Then A(r,n — 2) = (n — 2)" and 
PIE,E;] Ê P; = (1 — =) 


In a similar fashion, it is easy to show that P|E;E; Ep] = (1 — 3)" 4 ijk, and so on. Note 


that the right-hand side expressions for P;, P,;, Pijk, and so on do not contain the subscripts 


i, ij, ijk, and so on. Thus, each S; contains (7) identical terms and their sum amounts to 


s- (D 


Let E denote the event that at least one cell is empty. Then by Theorem 1.5-1, 
n 
P{E] = P [Öz = S1 -S2 +... + Sn 
i=1 


Substituting for S; from two lines above, we get 


n - T 
n i i 
P[E] = CDH {i->}. . 
i= (7) Co h-i) (1.8-8) 
The event that all cells are occupied is E°. Hence P[E°] = 1— P[E], which can be written as 


PIE] = > (7) (-1) (1 - iy. (1.8-9) 


i=0 
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Example 1.8-6 
(m cells empty) Use Equation 1.8-9 to compute the probability that exactly m out of the 
n cells are empty after the r balls have been distributed. We denote this probability by the 
three-parameter function Pm(r, n). 


Solution We write P|E*| = Po(r,n). Now assume that exactly m cells are empty and 
n—m cells are occupied. Next, let’s fix the m cells that are empty, for example, cells numbers 
2,4,5,7,...,/. Let B(r,n — m) be the number of ways of distributing r balls among the 
— 


m terms 
remaining n—m cells such that no cell remains empty and let A(r,n—m) denote the number 


of ways of distributing r balls among n—m cells. Then Po(r,n—m) = B(r,n—m)/A(r,n—m) 
and, since A(r,n—m) = (n—m)’, we get that B(r,n—m) = (n—m)" Po(r,n—m). There are 


(3) ways of placing m empty cells among n cells. Hence the total number of arrangements 


of r balls among n cells such that m remain empty is (z) (n —m)" Po(r,n — m). Finally, 


the number of ways of distributing r balls among n cells is n”. Thus, 


Pm(r,n) = (2) (n-m) Poalr,n— m)/n". 


or, after simplifying, 


P,,(r,n) = (7) ("3") (-1) (1- ny (1.8-10) 








1.9 BERNOULLI TRIALS—BINOMIAL AND MULTINOMIAL PROBABILITY LAWS 


Consider the very simple experiment consisting of a single trial with a binary outcome: a 
success {Ç} =s} with probability p, 0 < p < 1, or a failure {Ç =f} with probability q = 1—p. 
Thus, P|s] = p, P[f] = q and the sample space is Q = {s, f}. The o-field of events Fis 4, Q, 
{s}, {f}. Such an experiment is called a Bernoulli trial. 

Suppose we do the experiment twice. The new sample space Q2, written Q2 = 2 x Q, 
is the set of all ordered 2-tuples 


Qe = {ss, sf, fs, ff}. 


F contains 24 = 16 events. Some are ¢, Q, {ss}, {ss, ff}, and so forth. 
In the general case of n Bernoulli trials, the Cartesian product sample space becomes 


Qn = Qx*X AK... KD 
—_ a 


n times 
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and contains 2” elementary outcomes, each of which is an ordered n-tuple. Thus, 
Qn = {a,,...,aar}, 


where M = 2” and a; 4 2i,+-+2i,, an ordered n-tuple, where z; =s or f. Since each 
outcome z;, is independent of any other outcome, the joint probability Plz, -oo Zin] = 
Plz, |P[z,|...P[z:,]. Thus, the probability of a given ordered set of k successes and n — k 
failures is simply p*q"~*. 


Example 1.9-1 
(repeated trials of coin toss) suppose we throw a coin three times with p = P[H] and 
q = P[T]. The probability of the event {HTH} is pgp = p?°q. The probability of the event 
{THH} is also p?q. The different events leading to two heads and one tail are listed here: 





E, = {HHT}, 
Ez = {HTH}, 
E3 = {THH}. 


If F denotes the event of getting two heads and one tail without regard to order, then F = 
FE, U E2U E3. Since E,E; = ¢ for all i £ j, we obtain P[F'] = P[E,|+ P[E2]+ P[E3] = 3pq. 





Let us now generalize the previous result by considering an experiment consisting of n 
Bernoulli trials. The sample space Q, contains M = 2" outcomes aj, az,- . ., am, where each 
a; is a string of n symbols, and each symbol represents a success s or a failure f. Consider 
the event A, 4 {k successes in n trials} and let the primed outcomes, that is, aj, denote 
strings with k successes and n — k failures. Then, with K denoting the number of ordered 
arrangements involving k successes and n — k failures, we write 


K 
Ak = Ufai}. 


To determine how large K is, we use an artifice similar to that used in proving Equation 
1.8-6. Here, let represent failures and stars represent successes. Then, as an example, 


|+| x * || * 
represents five successes in nine tries in the order fssfssffs. How many such arrangements 


are there? The solution is given by Equation 1.8-6 with r = k and (n — 1) +r replaced by 
(n — k) +k = n. (Note that there is no restriction that the first and last symbols must be 


bars.) Thus, 
n 
x= (k) 
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and, since {a;} are disjoint, that is, {a;} N {a;} = ¢ for all i # j, we obtain 


P[Ay] = P 





K K 
Use] = Pte 

i=1 i=1 

Finally, since P[a;] = p*q”—* regardless of the ordering of the s’s and f’s, we obtain 


P[Ax] = (x) pg 


Ê bkin, p). (1.9-1) 


Binomial probability law. The three-parameter function b(k;n,p) defined in Equation 
1.9-1 is called the binomial probability law and is the probability of getting k successes 
in n independent tries with individual Bernoulli trial success probability p. The binomial 


coefficient 
n n 
r= (k) 


was introduced in the previous section and is the number of subpopulations of size k that 
can be formed from a population of size n. In Example 1.9-1 about tossing a coin three 
times, the population has size 3 (three tries) and the subpopulation has size 2 (two heads), 
and we were interested in getting two heads in three tries with order being irrelevant. Thus, 
the correct result is C3 = 3. Note that had we asked for the probability of getting two 
heads on the first two tosses followed by a tail, that is, P[E,], we would not have used the 
coefficient C3 since there is only one way that this event can happen. 








Example 1.9-2 
(draw two balls from urn) Suppose n = 4; that is, there are four balls numbered 1 to 4 in the 
urn. The number of distinguishable, ordered samples of size 2 that. can be drawn without 
replacement is 12, that is, {1,2}; {1,3}; {1,4}; {2,1}; {2,3}; {2,4}; {3,1}; {3,2}; {3,4}; 
{4,1}; {4,2}; {4,3}. The number of distinguishable unordered sets is 6, that is, 


From Equation 1.8-2 we obtain this result directly; that is (n = 4, k = 2) 


n 4! 
(x) = za T © 
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Example 1.9-3 —— — —  — S o o 
(binary pulses) Ten independent binary pulses per second arrive at a receiver. The error 
(i.e., a zero received as a one or vice versa) probability is 0.001. What is the probability of 
at least one error/second? ` 


Solution 


Pat least one error/sec] = 1 — P[no errors/sec] 


=1- (3) (0.001)? (0.999)!° = 1 — (0.999)? ~ 0.01. 








Observation. Note that 


X blk;n,p)=1. Why? 
k=0 


Example 1.9-4 ———— > 
(odd-man out) An odd number of people want to play a game that requires two teams made 
up of even numbers of players. To decide who shall be left out to act as umpire, each of 
the N persons tosses a fair coin with the following stipulation: If there is one person whose 
outcome (be it heads or tails) is different from the rest of the group, that person will be 
the umpire. Assume that there are 11 players. What is the probability that a player will be 
“odd-man out,” that is, will be the umpire on the first play? 


Solution Let E 4 {10H,1T}, where 10H means H,H,...,H ten times, and 
F Ê {10T, 1H}. Then EF = ¢ and 


P|EU F] = PE] + PIF 


-( DJO" DOO 


Example 1.9-5 — > o 
(more odd-man out) In Example 1.9-4 derive a formula for the probability that the odd-man 
out will occur for the first time on the nth play. (Hint: Consider each play as an independent 
Bernoulli trial with success if an odd-man out occurs and failure otherwise.) 


Solution Let E be the event of odd-man out for first time on the nth play. Let F be the 
event of no odd-man out in n — 1 plays and let G be the event of an odd-man out on the 
nth play. Then 


E=FG. 
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Since it is completely reasonable to assume F and G are independent events, we can write 
P[E] = P[F]P[G] 


P[F| = ("5 ‘) (0.0107)°(0.9893)"—? = (0.9893)"—1 


P[G] = 0.0107. 


Thus, P[E] = (0.0107)(0.9893)"—1,n > 1, which is often referred to as a geometric distri- 
bution! or law. 
Example 1.9-6 
(multiple lottery tickets) If you want to buy 50 lottery tickets, should you purchase them 
all in one lottery, or should you buy single tickets in 50 similar lotteries? For simplicity we 
take the case of a 100 ticket lottery with ticket prices of $1 each, and 50 such independent 
lotteries are available. Consider first the case of buying 50 tickets from one such lottery. Let 
E; denote the event that the ith ticket is the winning ticket. Since any ticket is as likely to 
be the winning ticket as any other ticket, and not more than one ticket can be a winner, 
we have by classical probability that P[E;] = nwin/ntot = 1/100 for 1= 1,...,100. The 
event of winning the lottery is that one of the 50 purchased tickets is the winning ticket or, 
equivalently, with Æ denoting the event that one of the 50 tickets is the winner E = U?°, E; 
and P[E] = P[u%, E] = 22, P[E:] = 50 x 1/100 = 0.5. Next we consider the case of 
buying 1 ticket in each of 50 separate lotteries. We recognize this as Bernoulli trials with 
an individual success probability p = 0.01 and q = 0.99. With the aid of a calculator, we 
can find the probability of winning (exactly) once as 


P[win once] = 6(1; 50, 0.01) 


= (D) (0.01)? (0.99)4° 


= 50 x 107? x 0.611 








= 0.306t 


Similarly, we find the probability of winning twice b(2;50,0.01) = 0.076, the probability 
of winning three times b(3;50,0.01) = 0.012, the probability of winning four times 
b(4; 50, 0.01) = 0.001, and the probability of winning more times is negligible. As a check 
we can easily calculate the probability of winning at least once, 


P{win at least once] = 1 — P[loose every time] 


=1— qe 
= 1—(0.99)°° 
= 0.395. 


tA popular variant on this definition is the alternative geometric distribution given as pg™,n > 0 with 
q=1-pandO0<p<l. 

tWe use the notation [equals sign with dot over top] to indicate that all the decimal digits shown are 
correct. 
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Indeed we have 0.395 = 0.306+ 0.076 + 0.012 +0.001. We thus find that, if your only concern 
is to win at least once, it is better to buy all 50 tickets from one lottery. On the other hand, 
when playing in separate lotteries, there is the possibility of winning multiple times. So your 
average winnings may be more of a concern. Assuming a fair lottery with payoff $100, we 
can calculate an average winnings as 


100 x 0.306 + 200 x 0.076 + 300 x 0.012 + 400 x 0.001 
= 49.8. 


So, in terms of average winnings, it is about the same either way. 








Further discussion of the binomial law. We write down some formulas for further use. 
The probability B(k;n,p) of k or fewer successes in n tries is given by 


k k i i 
B(k;n,p) =X blin p) = >> (7) pq’. (1.9-2) 


i=0 i=0 


The symbol B(k;n, p) is called the binomial distribution function. The probability of k or 
more successes in n tries is 


XC bli; n, p) = 1 — B(k — 1;n, p). 
i=k 
The probability of more than k successes but no more than j successes is 
j 
>> blin, p) = B(jin, p) — B(k;n, p). 
i=k+1 


There will be much more on distribution functions in later Chapters. We illustrate the 
application of this formula in Example 1.9-7. 





Example 1.9-7 

(missile attack) Five missiles are fired against an aircraft carrier in the ocean. It takes at 

least two direct hits to sink the carrier. All five missiles are on the correct trajectory but 

must get through the “point-defense” guns of the carrier. It is known that the point-defense 

guns can destroy a missile with probability p = 0.9. What is the probability that the carrier 
will still be afloat after the encounter? 


Solution Let E be the event that the carrier is still afloat and let F be the event of a 
missile getting through the point-defense guns. Then 


P[F] =0.1 
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and 


P[E] = 1 — P[E*] 


=1- 5 ($) (0.1)*(0.9)°~* ~ 0.92. 


z 





Multinomial Probability Law 


The multinomial probability law is a generalization of the binomial law. The binomial law 
is based on Bernoulli trials in which only two outcomes are possible. The multinomial law 
is based on a generalized Bernoulli trial in which / outcomes are possible. Thus, consider an 
elementary experiment consisting of a single trial with k elementary outcomes €,,¢5,.-.,¢). 
Let the probability of outcome Ç, be p; (i =1,...,1). Then 


l 
pi> 0, and > p=. (1.9-3) 
i=l 


Assume that this generalized Bernoulli trial is repeated n times and consider the event 
consisting of a prescribed, ordered string of elementary outcomes in which C} appears rı 
times, Ç appears r2 times, and so on until Ç, appears r; times. What is the probability of 
this event? The key here is that the order is prescribed a priori. For example, with | = 3 
(three possible outcomes) and n = 6 (six tries), a prescribed string might be €1¢3¢¢0¢1C2 
so that rı 2, T2 3, 73 1. Observe that an Ti = n. Since the outcome of each 
trial is an independent event, the probability of observing a prescribed ordered string is 
pi p7 ...p;'. Thus, for the string €,¢3¢96961¢2 the probability is p?p3p3. 

A different (greater) probability results when order is not specified. Suppose we perform 
n repetitions of a generalized Bernoulli trial and consider the event in which Ç} appears 
rı times, Çə appears r2 times, and so forth, without regard to order. Before computing the 
probability of this event we furnish an example. 








Example 1.9-8 
(busy emergency number) In calling the Sav-Yur-Life health care facility to report an emer- 
gency, one of three things can happen: 





(1) the line is busy (event F); 
(2) you get the wrong number (event Ez); and 
(3) you get through to the triage nurse (event E3). 


Assume P|E;] = p:i. What is the probability that in five separate emergencies at different 
times, initial calls are met with four busy signals and one wrong number? 


Solution Let F denote the event of getting four busy signals and one wrong number. 
Then 


F = Fy U Fo U F3 U Fa U Fs, where Fy = {EE EE E2}, 
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Fo = {E E, E EE}, Fy ={F E1 EEEn}, Fa = {EEE EEn}, 


and 
Fs = {FoF E EE}. 


Since F:F; = ¢, P[F] = 2_, P[F;]. But P[F;] = ptp}p3 independent of i. Hence 
P(F] = 5pipaps. 
With the assumed p; = 0.3, po = 0.1, p3 = 0.6, we get 
P[F] = 5 x 8.1 x 107° x 0.1 x 1 = 0.004. 





In problems of this type we must count all the strings of length n in which Ç} appears 
rı times, Ç, appears rz times, and so on. In the example just considered, there were five 
such strings. In the general case of n trials with rı outcomes of ¢,, r2 outcomes of Cp, and 
so on, there are 


n! 
— , 1.9-4 
rilre!...ry! ( ) 
such strings. In Example 1.9-8, n = 5, rı = 4, r2 = 1, r3 = 0 so that 
5! 
aio! S 


The number in Equation 1.9-4 is recognized as the multinomial coefficient. To check that it 
is the appropriate coefficient, consider the rı outcomes ¢,. The number of ways of placing 
the rı outcomes ¢, among the n trials is identical with the number of subpopulations of 


size rı in a population of size n which is (z ). That leaves n — rı trials among which we 
1 


wish to place r2 outcomes ¢,. The number of ways of doing that is (" r r ). Repeating 
2 


this process we obtain the total number of distinguishable arrangements 


n n—ri n=Ti = T2.. S TaN n! 
ry T2 ai ri rylral... r! 


Example 1.9-9 
(repeated generalized Bernoulli) Consider four repetitions of a generalized Bernoulli experi- 
ment in which the outcomes are x, e, 0. What is the number of ways of getting two *, one e, 
and one 0. 








Solution The number of ways of getting two * in four trials is (2) = 6. If we let the 


2 
spaces between bars represent a trial, then we can denote the outcomes as 


lap e del det a] b&b g Eek l E leh 
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The number of ways of placing e among the two remaining cells is ( 7) = 2. The number 


of ways of placing 0 among the remaining cell is ( i) = 1. Hence the total number of 


arrangements is 6 - 2- 1 = 12. They are 


(Em m m * * * * ¥* 
coe * * *¥ *¥ OO Oxe oo 
* * OO * * *¥ * @ oOo 
* * * *¥ COe8 Ooo Ooo xk * 











We can now state the multinomial probability law. Consider a generalized Bernoulli trial 
with outcomes ¢),¢5,..-,¢, and let the probability of observing outcome Ç; be p;, i = 
1,...,2, where p; > 0 and an pi = 1. The probability that in n trials Ç} occurs rı times, 
Ca occurs r2 times, and so on is 


n! 


P(r;n,p) = nPI Pe Drs (1.9-5) 


ry!ro! oe 
where r and p are l-tuples defined by 


l 
T =(ri,re,...,71), p = (p1, P2,- --, Pi), and Yr=n. 


i=1 


Observation. With / = 2, Equation 1.9-5 becomes the binomial law with pı 4 p, 


p2 Ay- P, Tı 4 k, and r2 4S n—k. Functions such as Equations 1.9-1 and 1.9-5 are 
often called probability mass functions. 


Example 1.9-10 
(emergency calls) In the United States, 911 is the all-purpose number used to summon an 
ambulance, the police, or the fire department. In the rowdy city of Nirvana in upstate New 
York, it has been found that 60 percent of all calls request the police, 25 percent request 
an ambulance, and 15 percent request the fire department. We observe the next ten calls. 
What is the probability of the combined event that six calls will ask for the police, three 
for ambulances, and one for the fire department? 
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Solution Using Equation 1.9-5 we get 


P(6, 3, 1; 10, 0.6, 0.25, 0.15) 


10! 


= BN! (0.6)ê(0.25)?(0.15)* ~ 0.092. 





A numerical problem appears if n gets large. For example, suppose we observe 100 calls and 
consider the event of 60 calls for the police, 30 for ambulances, and 10 for the fire department; 
clearly computing numbers such as 100!, 60!, 30! requires some care. An important result 
that helps in evaluating such large factorials is Stirling’s' formula: 


n! œ (2r) nt Der, 


where the approximation improves as n increases, for example, 


n n! Stirling’s formula Percent error 
1 1 0.922137 8 
10 3,628,800 3,598,700 0.8 


When using a numerical computer to evaluate Equation 1.9-5, additional care must be used 
to avoid loss of accuracy due to under- and over-flow. A joint evaluation of pairs of large 
and small numbers can help in this regard, as can the use of logarithms. 

As stated earlier, the binomial law is a special case, perhaps the most important case, 
of the multinomial law. When the parameters of the binomial law attain extreme values, the 
binomial law can be used to generate another important probability law. This is explored 
next. 


1.10 ASYMPTOTIC BEHAVIOR OF THE BINOMIAL LAW: THE POISSON 
LAW 


Suppose that in the binomial function b(k; n, p), n >> 1, p << 1, but np remains constant, 
say np = u. Recall that q = 1 — p. Hence 


n\ ka yn—k L kf BNE 
(z) petag -R > 


where n(n — 1)... (n — k +1) ~ n¥ if n is allowed to become large enough and k is held 
fixed. Hence in the limit as n — œ, p — 0, and k << n, we obtain 


1 n-k 
b(k;n,p) = e" (1 - E) nee Ere. (1.10-1) 


t James Stirling, eighteenth-century mathematician. 
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Thus, in situations where the binomial law applies with n >> 1, p << 1 but np = nisa 
finite constant, we can use the approximation 


b(k; wenn 1.10-2 
in, p) > RI . ( fl ) 


Poisson law. The Poisson probability law, with parameter u(> 0), is given as 


pk 
p(k) = HE” 0<k< oœ. 
Unlike the binomial law, the Poisson law just has one parameter u that can take on any 
positive value. 


Example 1.10-1 
(time to failure) A computer contains 10,000 components. Each component fails indepen- 
dently from the others and the yearly failure probability per component is 1074. What is 
the probability that the computer will be working one year after turn-on? Assume that the 
computer fails if one or more components fail. 





Solution 
p= 10-4, n = 10,000, k=0, np = 1. 


Hence 
0 


1 1 
b(0; 10,000, 10-4) = —e7} = — = 0.368. 
0! e 


Example 1.10-2 —— > SS 
(random points in time) Suppose that n independent points are placed at random in an 


interval (0, T). Let 0 < tı < t2 < T and t — tı 4 7, Let T/T << 1 and n >> 1. What is 
the probability of observing exactly k points in r seconds? (Figure 1.10-1.) 


Solution Consider a single point placed at random in (0,T). The probability of the point 
appearing in 7 is 7/T. Let p = T/T. Every other point has the same probability of being in 
T seconds. Hence, the probability of finding k points in 7 seconds is the binomial law 


P[k points in 7 sec] = (z) pg E. (1.10-3) 


With n >> 1, we use the approximation in Equation 1.10-1 to give 


k e-(nT/T) 
my eo (1.10-4) 


blk; n, p) ~ (F m` 


where n/T can be interpreted as the “average” number of points per unit interval. 





Replacing the average rate in this example with parameter u (u > 0), we get the Poisson 
law defined by 


k 
P[k points] = ene ; (1.10-5) 
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Figure 1.10-1 Points placed at random on a line. Each point is placed with equal likelihood anywhere 
along the line. 


where k = 0,1,2,.... With u 4 AT, where À is the average number of points! per unit time 
and 7 is the length of the interval (t,t + 7], the probability of k points in elapsed time 7 is 


(Ar)* 
kl 





P(k;t,t +r) =e (1.10-6) 
For the Poisson law, we also stipulate that numbers of points arriving in disjoint time inter- 
vals constitute independent events. We can regard this as inherited from an underlying set 
of Bernoulli trials, which are always independent. 

In Equation 1.10-6 we assume that À is a constant and not a function of t. If À varies 
with ¢, we can generalize Ar with the integral fr A(u) du, and the probability of k points 
in the interval (t, t+ 7] becomes 


k 


P(k;t,t +7) = exp |- f BE du! a | f Nal , (1.10-7) 


The Poisson law P[k events in Az], or more generally P[k events in (x,z + Az)], where z 
is time, volume, distance, and so forth and Az is the interval associated with x, is widely 
used in engineering and sciences. Some typical situations in various fields where the Poisson 
law is applied are listed below. 


Physics. In radioactive decay—P[k a-particles in 7 seconds} with AÀ the average 
number of emitted a-particles per second. 

Engineering. In planning the size of a call center—P[k telephone calls in 7 
seconds] with À the average number of calls per second. 

Biology. In water pollution monitoring—P[k coliform bacteria in 1000 cubic centime- 
ters] with the average number of coliform bacteria per cubic centimeter. 

Transportation. In planning the size of a highway toll facility—P[k automobiles 
arriving in 7 minutes] with À the average number of automobiles per minute. 

Optics. In designing an optical receiver—P[k photons per second over a surface 
of area A] with A the average number of photons-per-second per unit area. 

Communications. In designing a fiber optical transmitter—receiver link—P[k photoelec- 
trons generated at the receiver in one second] with À the average number of photo- 
electrons per second. 


tThe term points here is a generic term. Equally appropriate would be “arrivals,” “hits,” “occurrences,” 
etc. 
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The parameter A is often called, in this context, the Poisson rate parameter. Its dimen- 
sions are points per unit interval, the interval being time, distance, volume, and so forth. 
When the form of the Poisson law that we wish to use is as in Equation 1.10-6 or 1.10-7, 
we speak of the Poisson law with rate parameter À or rate function X(t). 


Example 1.10-3 
(misuse of probability) (a) “Prove” that there must be life in the universe, other than that 
on our own Earth, by using the following numbers: average number of stars per galaxy, 
300 x 109; number of galaxies, 100 x 10°; probability that a star has a planetary system, 0.5; 
average number of planets per planetary system, 9; probability that a planet can sustain 
life, 1/9; probability, p, of life emerging on a life-sustaining planet, 10712. 





Solution First we compute, ngs, the number of planets that are life-sustaining: 
nis = 300 x 10° x 100 x 10° x 0.5 x 9 x 1/9 
= 1.5 x 10”. 


Next we use the Poisson approximation to the binomial with a = nis p = 1.5 x 10? x 10712, 
for computing the probability of no life outside of Earth’s and obtain 


(1.5 x 10y e715x10 0. 


Hence we have just “shown” that the probability of life outside Earth has a probability of 
unity, that is, a sure bet. Note that the number for life emerging on other planets, 10712, 
is impressively low. l 

(b) Now show that life outside Earth is extremely unlikely by using the same set of 
numbers except that the probability of life emerging on a life-sustaining planet has been 
reduced to 10730, 


b(0,1.5 x 10??, 1071?) = 


Solution Using the Poisson approximation to the binomial, with a = 1.5 x 10?? x 10739 = 
1.5 x 1078, we obtain for the probability of no life outside Earth’s: 


(1.5 x 1078)? -15x107 
0! 
%1- (1.5 x 107°) 1, 


b(0, 1.5 x 1077, 10720) = 


where we have used the approximation e~* ~ 1 — z for small z. 

Thus, by changing only one number, we have gone from “proving” that the universe 
contains extraterrestrial life to proving that, outside of ourselves, the universe is lifeless. 
The reason that this is a misuse of probability is that, at present, we have no idea as to the 
factors that lead to the emergence of life from nonliving material. While the calculation is 
technically correct, this example illustrates the use of contrived numbers to either prove or 
disprove what is essentially a belief or faith. 


All the numbers have been quoted at various times by proponents of the idea of extraterrestrial life. 


Sec. 1.10. ASYMPTOTIC BEHAVIOR OF THE BINOMIAL LAW: THE POISSON LAW 73 





Example 1.10-4 
(website server) A website server receives on the average 16 access requests per minute. If 
the server can handle at most 24 accesses per minute, what is the probability that in any 
one minute the website will saturate? 





Solution Saturation occurs if the number of requests in a minute exceeds 24. The prob- 
ability of this event is 





P{saturation] = 5 [A7]" (1.10-8) 


16 
=5 [is ~ 0.017 ~ 1/60. (1.10-9) 
k=25 i 


Thus, about once in every 60 minutes (on the “average” ) will a visitor be turned away. 





Given the numerous applications of the Poisson law in engineering and the sciences, 
one would think that its origin is of somewhat more noble birth than “merely” as a limiting 
form of the binomial law. Indeed this is the case, and the Poisson law can be derived once 
three assumptions are made. Obviously these three assumptions should reasonably mirror 
the characteristics of the underlying physical process; otherwise our results will be of only 
marginal interest. Fortunately, in many situations these assumptions seem to be quite valid. 

In order to be concrete, we shall talk about occurrences taking place in time (as opposed 
to, say, length or distance). The Poisson law is based on the following three assumptions: 


1. The probability, P(1;t,t+At), of a single event occurring in (t, t+ Ad] is proportional 
to At, that is, 
P(1;¢,t + At) = A(t) At At — 0. (1.10-10) 


In Equation 1.10-10, A(t) is the Poisson rate parameter. 
2. The probability of k (k > 1) events in (t,t + At] goes to zero: 


P(kit,t+At)~O At>0, k=2,3,.... (1.10-11) 


3. Events in nonoverlapping time intervals are statistically independent.t 


Starting with these three simple physical assumptions, it is a straightforward task to 
obtain the Poisson probability law. We leave this derivation to Chapter 9 but merely point 
out that the clever use of the assumptions leads to a set of elementary, first-order differ- 
ential equations whose solution is the Poisson law. The general solution is furnished by 
Equation 1.10-7 but, fortunately, in a large number of physical situations the Poisson rate 


tNote in property 3 we are talking about disjoint time intervals, not disjoint events. For disjoint events 
we would add probabilities, but for disjoint time intervals which lead to independent events in the Poisson 
law, we multiply the individual probabilities. 
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parameter A(t) can be approximated by a constant, say, À. In that case Equation 1.10-6 can 
be applied. We conclude this section with a final example. 


Example 1.10-5 
(defects in digital tape) A manufacturer of computer tape finds that the defect density along 
the length of tape is not uniform. After a careful compilation of data, it is found that for 
tape strips of length D, the defect density A(x) along the tape length z varies as 








1 2 
Az) = ro + 501 — ro) (1+ cos |) , di > Ao 
for 0 < x < D due to greater tape contamination at the edges x = 0 and x = D. 


(a) What is the meaning of \(z) in this case? 

(b) What is the average number of defects for a tape strip of length D? 
(c) What is an expression for k defects on a tape strip of length D? 

(d) What are the Poisson assumptions in the case? 


Solution 


(a) Bearing in mind that A(z) is a defect density, that is, the average number of defects 
per unit length at z, we conclude that A(x)Az is the average number of defects in 
the tape from z to x + Az. 

(b) Given the definition of A(z), we conclude that the average number of defects along 
the whole tape is merely the integral of A(x), that is, 


mi \(a)dx = L [a + TON — Xo) (1 + cos meN] de 


_ Ao+A1 


SA. 


(c) Assuming the Poisson law holds, we use Equation 1.10-7 with z and Ax (distances) 
replacing t and 7 (times). Thus, 


r+Ar s+hn 
P(k; 2,2 + Az) = exp - J oad| . 5 | f xod 


In particular, with « = 0 and x + Az = D, we obtain 


k 


—A 
P(k;0, D) = Ate, 


where A is as defined above. 
(d) The Poisson assumptions become 


(i) P[l;x,x + Az) ~ A(x)Az, as Ar 0. 
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(ii) P[k; z,2+Az] =0 Az—0, for k=2,3,...; that is, the probability 
of there being more than one defect in the interval (z,2 + Az) as Az 
becomes vanishingly small is zero. 

(iii) the occurrences of defects (events) in nonoverlapping sections of the tape 
are independent. 








1.11 NORMAL APPROXIMATION TO THE BINOMIAL LAW 


In this section we give, without proof, a numerical approximation to binomial probabilities 
and binomial sums. Let S$; denote the event consisting of (exactly) k successes in n Bernoulli 
trials. Then the probability of S% follows a binomial distribution and 


P[Sk] = (i) pkg” = b(kyn,p), OSk <n. (1.11-1) 
For large values of n and k, Equation 1.11-1 may be difficult to evaluate numerically. Also, 
the probability of the event {kı <number of successes< ka} may involve many terms, making 
a direct evaluation of its probability P[k, <number of successes< ke] difficult. Fortunately, 
when n is large, we can use approximate methods for evaluating such probabilities. These 
approximate methods involve the so-called Normal or Gaussian distribution. 

The Normal distribution and its significance will be discussed in greater detail in 
Chapter 2 and subsequent chapters in this book. Here we use it only to help evaluate 
binomial probabilities. For the present, define the function fsn (z), known as the standard 
Normal density, by 


fsn(x) Ê = exp (-32"). (1.11-2) 


and its running integral, known as the standard Normal cumulative distribution function, 
by 


1 z 1 
Fsn(£) = —— — ay? ) dy. 11- 
sn (2) zl 3Y ) dy (1 11 3) 
Then, when n is large it can be shown [1-8] that 
1 k- me) 
b(k:n,p) = —— zw), 1.11-4 
(k;n,p) Japa S" ( Japi ( ) 


The approximation becomes better when npg >> 1. We reproduce the results from 
[1-8] in Table 1.11-1. Even in this case, npg = 1.6, the approximation is quite good. The 
approximation for sums, when n >> 1 and k; and kz are fixed integers, takes the form 


k2 — np + 25 -F E —np— 25] 
VAO i y "Pq 


P[k < number of successes < kz] ~ Fsn | 


(1.11-5) 
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Table 1.11-1 Normal Approximation to the 
Binomial for Selected Numbers 


[K [rna 
a 0.1074 0.0904 

















Table 1.11-2 Event Probabilities Using the Normal 
AAA (Adapted from [1-8]) 





Normal 
approximation 


0.5 = 105 | 0.5632 0.5633 
0.1 | 50 0.3176 0.3235 








0.3 | 12 0.00015 0.00033 
0.3 | 27 0.2379 0.2341 


0.00005 0.00003 








Some results, for various values of n, p, kı, k2, are furnished in Table 1.11-2, which uses the 
results in [1-8]. 

In using the Normal approximation, one should refer to Table 2.4-1. In Table 2.4-1 a 
function called erf(x) is given rather than Fsxy (x). The erf() is defined by 


erf(x) Ê Tun = fe 


However, since it is easy to show that 


Fgn(z) = 35+ erf(z), x>0, (1.11-6) 


Fsy(x) = 4 —erf(|z|), 2 <0, (1.11-7) 
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we can compute Equation 1.11-5 in terms of the table values. Thus, with a = 2 E 


and b Ê mre and b’ > a’, we can use the results in Table 1.11-3. 

The Normal approximation is also useful in evaluating Poisson sums. For example, a 
sum such as in Equation 1.10-9 is tedious to evaluate if done directly. However, if Ar >> 1, 
we can use the Normal approximation to the Poisson law, which is merely an extension of the 
Normal approximation to the binomial law. This extension is expected since we have seen 
that the Poisson law is itself an approximation to the binomial law under certain circum- 


stances. From the results given above we are able to justify the following approximation. 





B k l2 
—AT [AT] zI (-3 i*) 
e exp | —= dy, 1.11-8 
2 E vag}, & y | ( ) 
where 
Lê B— AT +0.5 
? VAT 
and 
I A a@—Ar—0.5 
1 Jar 
Another useful approximation is 
la 
—ÀT [Ar]* z. l2 E 
e "r = Tn exp ( sv dy, (1.11-9) 
where 
Lê k—Aàr+0.5 
$ VAT 
and 
l A k~Ar—0.5 
3 Var 


For example, with Ar = 5, and k = 5, the error in using the Normal approximation of 
Equation 1.11-9 is less than 1 percent. 


SUMMARY 


In this, the first chapter of the book, we have reviewed some different definitions of proba- 
bility. We developed the axiomatic theory and showed that for a random experiment three 
important objects were required: the sample space 9, the sigma field of events Z and a 
probability measure P. The mathematical triple (0,.% P) is called the probability space Z 

We introduced the important notions of independent, dependent, and compound events, 
and conditional probability. We developed a number of relations to enable the application 
of these. 
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We discussed some important formulas from combinatorics and briefly illustrated how 
important they were in theoretical physics. We then discussed the binomial probability law 
and its generalization, the multinomial law. We saw that the binomial law could, when 
certain limiting conditions were valid, be approximated by the Poisson law. The Poisson 
law, one of the central laws in probability theory, was shown to have application in numerous 
branches of science and engineering. We stated, but deferred verification until Chapter 9, 
that the Poisson law can be derived directly from simple and entirely reasonable physical 
assumptions. 

Approximations for the binomial and Poisson laws, based on the Normal distribu- 
tion, were furnished. Several occupancy problems of engineering interest were discussed. 
In Chapter 4 we shall revisit these problems. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 


1.1 In order for a statement such as “Ralph is probably guilty of theft” to have meaning 
in the relative frequency approach to probability, what kind of data would one need? 

1.2 Problems in applied probability (a branch of mathematics called statistics) often 
involve testing P — Q (P implies Q) type statements, for example, if she smokes, 
she will probably get sick; if he is smart he will do well in school. You are given a 
set of four cards that have a letter on one side and a number on the other. You are 
asked to test the rule “If a card has a D on one side, it has a three on the other.” 
Which of the following cards should you turn over to test the veracity of the rule: 


Card 1 Card 2 Card 3 Card 4 


Be careful here! 

1.3 In a spinning-wheel game, the spinning wheel contains the numbers 1 to 9. The 
contestant wins if an odd number shows. What is the probability of a win? What 
are your assumptions? 

1.4 A fair coin is flipped three times. The outcomes on each flip are heads H or tails T. 
What is the probability of obtaining two tails and one head? 

1.5 An urn contains three balls numbered 1, 2, 3. The experiment consists of drawing a 
ball at random, recording the number, and replacing the ball before the next ball is 
drawn. This is called sampling with replacement. What is the probability of drawing 
the same ball thrice in three tries? 

1.6 An experiment consists of drawing two balls without replacement from an urn 
containing five balls numbered 1 to 5. Describe the sample space 2. What is Q 
if the ball is replaced before the second is drawn? 
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1.7 


1.8 


1.9 


1.13 


1.14 


1.16 


The experiment consists of measuring the heights of each partner of a randomly 
chosen married couple. (a) Describe Q in convenient notation; (b) let E be the event 
that the man is shorter than the woman. Describe E in convenient notation. 

An urn contains ten balls numbered 1 to 10. Let E be the event of drawing a ball 
numbered no greater than 6. Let. F be the event of drawing a ball numbered greater 
than 3 but less than 9. Evaluate E°, FS, EF, EUF, EFS, EU F’, EF°U ESF, 
EF U E°F*, (EU F)*, and (EF)*°. Express these events in words. 

There are four equally likely outcomes €,,¢2,¢3, and ¢, and two events A = {¢,,¢,} 
and B = {¢€5,¢,}. Express the sets (events) AB°, BA‘°, AB, and AU B in terms of 
their elements (outcomes). 

Verify the useful set identities A= ABU AB‘ and AU B = (AB*)U(BA‘) U (AB). 
Does probability add over these unions? Why? 

In a given random experiment there are four equally likely outcomes €,,¢9,¢3, and C,. 


Let the event A Ê {¢,,¢2}. What is the probability of A? What is the event (set) A 
in terms of the outcomes? What is the probability of A°? Verify that P[A] = 1—P[A‘] 
here. 

Consider the probability space (Q, F, P) for this problem. 


(a) State the three axioms of probability theory and explain in a sentence the 
significance of each. 

(b) Derive the following formula, justifying each step by reference to the appro- 
priate axiom above, 


P[A U B] = P[A] + P[B] — P[AN B], 


where A and B are arbitrary events in the field F. 


An experiment consists of drawing two balls at random, with replacement from 
an urn containing five balls numbered 1 to 5. Three students “Dim,” “Dense,” and 
“Smart” were asked to compute the probability p that the sum of numbers appearing 
on the two draws equals 5. Dim computed p = Ž, arguing that there are 15 distin- 
guishable unordered pairs and only 2 are favorable, that is, (1,4) and (2,3). Dense 
computed p = Tt arguing that there are 9 distinguishable sums (2 to 10), of which 
only 1 was favorable. Smart computed p = 5, arguing that there were 25 distin- 
guishable ordered outcomes of which 4 were favorable, that is, (4,1), (3,2), (2,3), 
and (1,4). Why is p = 4 the correct answer? Explain what is wrong with the 
reasoning of Dense and Dim. 


Prove the distributive law for set union, that is, 
AU(BNC)=(AUB)N(AUC), 


by showing that each side is contained in the other. 

Prove the general result P[A] = 1 — P[A‘] for any probability experiment and any 
event A defined on this experiment. 

Let Q = {1,2,3,4,5,6}. Define three events: A = {1,2}, B = {2,3}, and C = 
{4,5,6}. The probability measure is unknown, but it satisfies the three axioms. 
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1.17 


1.18 


1.19 


1.20 


1.21 


*1.22 


1.23 


1.25 


(a) What is the probability of ANC? 

(b) What is the probability of AU BUC? 

(c) State a condition on the probability of either B or C that would allow them 
to be independent events. 


Use the axioms given in Equations 1.5-1 to 1.5-3 to show the following: (E € #& 
F €F ) (a) Pld] = 0; (b) PIEF4| = PIE] — PEF]; (c) PIE] = 1 — PIE". 

Use the probability space (Q, F, P) for this problem. What is the difference between 
an outcome, an event, and a field of events? 

Use the axioms of probability to show the following: (A € F,B € F): P[AU B] = 
P(A] + P[B] — P|AN B], where P is the probability measure on the sample space 
Q, and F is the field of events. 

Use the “exclusive-or” operator in Equation 1.4-3 to show that P[E@F] = P[EF*]+ 
P[ESF]. 

Show that P[E @ F] in the previous problem can be written as P[E F] = P[E] + 
P[F] — 2P[EF}. 

Let the sample space 2 = {cat, dog, goat, pig}. 


(a) Assume that only the following probability information is given: 


P[{cat, dog}] = 0.9, 

Pl{goat, pig}] = 0.1, 
P[{pig}] = 0.05, 
P{{dog}] = 0.5. 


For this given set of probabilities, find the appropriate field of events .¥ 
so that the overall probability space (Q,.4P) is well defined. Specify the 
field .¥ by listing all the events in the field, along with their corresponding 
probabilities. 

(b) Repeat part (a), but without the information that P{{pig}] = 0.05. 


Prove the distributive law for set intersection, that is, 
AN(BUC)=(ANB)U(ANC), 


by showing that each side is contained in the other. 

The probability that a communication system will have high fidelity is 0.81, and the 
probability that it will have high fidelity and high selectivity is 0.18. What is the 
probability that a system with high fidelity will also have high selectivity? 

An urn contains eight balls. The letters a and b are used to label the balls. Two balls 
are labeled a, two are labeled b, and the remaining balls are labeled with both letters, 
that is a,b. Except for the labels, all the balls are identical. Now a ball is drawn at 
random from the urn. Let A and B represent the events of observing letters a and 
b, respectively. Find P[A], P[B], and P[AB]. Are A and B independent? (Note that 
you will observe the letter a when you draw either an a ball or an a,b ball.) 
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1.26 


1.27 


1.28 


1.29 


1.30 


1.31 


1.32 


A fair die is tossed twice (a die is said to be fair if all outcomes 1,...,6 are equally 
likely). Given that a 3 appears on the first toss, what is the probability of obtaining 
the sum 7 after the second toss? 
In the experiment of throwing two fair dice, A is the event that the number on 
the first die is odd, B the event that the number on the second die is odd, and C 
the event that the sum of the faces is odd. Show that A, B and C are pairwise 
independent, but A, B and C are not independent. 
Two numbers are chosen at random from the numbers 1 to 10 without replacement. 
Find the probability that the second number chosen is 5. 
A random-number generator generates integers from 1 to 9 (inclusive). All outcomes 
are equally likely; each integer is generated independently of any previous integer. 
Let © denote the sum of two consecutively generated integers; that is, © = Ni + N2. 
Given that © is odd, what is the conditional probability that X is 7? Given that 
£ > 10, what is the conditional probability that at least one of the integers is > 7? 
Given that N: > 8, what is the conditional probability that © will be odd? 
Two firms, V and W, consider bidding on a road-building job which may or may 
not be awarded depending on the amount of the bids. Firm V submits a bid and the 
probability is 3/4 that V will get the job, provided firm W does not bid. The odds 
are 3 to 1 that W will bid and if it does, the probability that V will get the job is 
only 1/3. 

(a) What is the probability that V will get the job? 

(b) If V gets the job, what is the probability that W did not bid? 


Henrietta is'29 years old and physically very fit. In college she majored in geology. 
During her student days, she frequently hiked in the national forests and biked in the 
national parks. She participated in anti-logging and anti-mining operations. Now, 
Henrietta works in an office building in downtown Nirvana. Which is greater: the 
probability that Henrietta’s occupation is that of office manager; or the probability 
that Henrietta is an office manager who is active in nature-defense organizations like 
the Sierra Club? 

In the ternary communication channel shown in Figure P1.32 a 3 is sent three times 
more frequently than a 1, and a 2 is sent two times more frequently than a 1. A 1 is 
observed; what is the conditional probability that a 1 was sent? 





X= O Y=1 
PIY=2|X=2]=1-, 

X=2 Y=2 

X=3 Y=3 


PIY=3|X=3]=1-7 


Figure P1.32 Ternary communication channel. 
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A large class in probability theory is taking a multiple-choice test. For a particular 
question on the test, the fraction of examinees who know the answer is p; 1 — p is the 
fraction that will guess. The probability of answering a question correctly is unity 
for an examinee who knows the answer and 1/m for a guessee; m is the number of 
multiple-choice alternatives. Compute the probability that an examinee knew the 
answer to a question given that he or she has correctly answered it. 

In the beauty-contest problem, Example 1.6-12, what is the probability of picking 
the most beautiful contestant if we decide a priori to choose the ith (1 <i < N) 
contestant? 

Assume there are three machines A, B, and C in a semiconductor manufacturing 
facility that make chips. They manufacture, respectively, 25, 35, and 40 percent 
of the total semiconductor chips there. Of their outputs, respectively, 6, 4, and 2 
percent of the chips are defective. A chip is drawn randomly from the combined 
output of the three machines and is found defective. What is the probability that 
this defective chip was manufactured by machine A? by machine B? by machine C? 
In Example 1.6-12, plot the probability of making a correct decision versus a/N, 
assuming that the “wait-and-see” strategy is adopted. In particular, what is P[D] 
when a/N = 0.5. What does this suggest about the sensitivity of P|D] vis-a-vis a 
when @ is not too far from ag and N is large? 

In the village of Madre de la Paz in San Origami, a great flood displaces 103 villagers. 
The government builds a temporary tent village of 30 tents and assigns the 103 
villagers randomly to the 30 tents. 


(a) Identify this problem as an occupancy problem. What are the analogues to 
the balls and cells? 

(b) How many distinguishable distributions of people in tents can be made? 

(c) How many distinguishable distributions are there in which no tent remains 
empty? 


Consider r indistinguishable balls (particles) and n cells (states) where n > r. The 
r balls are placed at random into the n cells (multiple occupancy is possible). What 
is the probability P that the r balls appear in r preselected cells (one to a cell)? 

A committee of 5 people is to be selected randomly from a group of 5 men and 10 
women. Find the probability that the committee consists of (a) 2 men and 3 women, 
and (b) only women. 

Three tribal elders win elections to lead the unstable region of North Vatisthisstan. 
Five identical assault rifles, a gift of the people of Sodabia, are airdropped among 
a meeting of the three leaders. The tribal leaders scamper to collect as many of the 
rifles as they each can carry, which is five. 


(a) Identify this as an occupancy problem. 

(b) List all possible distinguishable distribution of rifles among the three tribal 
leaders. 

(c) How many distinguishable distributions are there where at least one of the 
tribal leaders fails to collect any rifles? 

(d) What is the probability that all tribal leaders collect, at least one rifle? 
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(e) What is the probability that exactly one tribal leader will not collect any 
rifles? 


1.41 In some casinos there is the game Sic bo, in which bettors bet on the outcome of 
a throw of three dice. Many bets are possible each with a different payoff. We list 
some of them below with the associated payoffs in parentheses: 


(a) Specified three of a kind (180 to 1); 
(b) Unspecified three of a kind (30 to 1); 
(c) Specified two of a kind (10 to 1); 
) Sum of three dice equals 4 or 17 (60 to 1) 
(e) Sum of three dice equals 5 or 16 (30 to 1); 
) Sum of three dice equals 6 or 15 (17 to 1); 
(g) Sum of three dice equals 7 or 14 (12 to 1) 
(h) Sum of three dice equals 8 or 13 (8 to 1); 
(i) Sum of three dice equals 9, 10, 11, 12 (6 to 1); 
(j) Specified two dice combination; that is, of the three dice displayed, two of 
them must match exactly the combination wagered (5 to 1). 


We wish to compute the associated probabilities of winning from the player’s point 
of view and his expected gain. 

1.42 Most communication networks use packet switching to create virtual circuits between 
two users, even though the users are sharing the same physical channel with others. 
In packet switching, the data stream is broken up into packets that travel different 
paths and are reassembled in the proper chronological order and at the correct 
address. Suppose the order information is missing. Compute the probability that a 
data stream broken up into N packets will reassemble itself correctly, even without 
the order information. 

1.43 Inthe previous problem assume that N = 4. A lazy engineer decides to omit the order 
information in favor of repeatedly sending the data stream until the packets re-order 
correctly for the first time. Derive a formula that the correct re-ordering occurs for 
the first time on the nth try. How many repetitions should be allowed before the 
cumulative probability of a correct re-ordering for the first time is at least 0.95? 

1.44 Prove that the binomial law 6(k;n,p) is a valid probability assignment by showing 
that )7y_ (Kk; n, p) = 1. 

1.45 War-game strategists make a living by solving problems of the following type. There 
are 6 incoming ballistic missiles (BMs) against which are fired 12 antimissile missiles 
(AMMs). The AMMs are fired so that two AMMs are directed against each BM. 
The single-shot-kill probability (SSKP) of an AMM is 0.8. The SSKP is simply the 
probability that an AMM destroys a BM. Assume that the AMM’s don’t interfere 
with each other and that an AMM can, at most, destroy only the BM against 
which it is fired. Compute the probability that (a) all BMs are destroyed, (b) at 
least one BM gets through to destroy the target, and (c) exactly one BM gets 
through. 

1.46 Assume in the previous problem that the target was destroyed by the BMs. What 
is the conditional probability that only one BM got through? 
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A computer chip manufacturer finds that, historically, for every 100 chips produced, 
80 meet specifications, 15 need reworking, and 5 need to be discarded. Ten chips are 
chosen for inspection. 


(a) What is the probability that all 10 meet specs? 

(b) What is the probability that 2 or more need to be discarded? 

(c) What is the probability that 8 meet specs, 1 needs reworking, and 1 will be 
discarded? 


Unlike the city of Nirvana, New York, where 911 is the all-purpose telephone number 
for emergencies, in Moscow, Russia, you dial 01 for a fire emergency, 02 for the police, 
and 03 for an ambulance. It is estimated that emergency calls in Russia have the 
same frequency distribution as in Nirvana, namely, 60 percent are for the police, 
25 percent are for ambulance service, and 15 percent are for the fire department. 
Assume that 10 calls are monitored and that none of the calls overlap in time and 
that the calls constitute independent trials. 

A smuggler, trying to pass himself off as a glass-bead importer, attempts to smuggle 
diamonds by mixing diamond beads among glass beads in the proportion of one 
diamond bead per 2000 beads. A harried customs inspector examines a sample of 
100 beads. What is the probability that the smuggler will be caught, that is, that 
there will be at least one diamond bead in the sample? 

Assume that a faulty receiver produces audible clicks to the great annoyance of the 
listener. The average number of clicks per second depends on the receiver tempera- 
ture and is given by A(T) = 1 — e~7/!9, where 7 is time from turn-on. Evaluate the 
formula for the probability of 0,1, 2,... clicks during the first 5 seconds of operation 
after turn-on. Assume the Poisson law. 

A frequently held lottery sells 100 tickets at $1 per ticket every time it is held. One 
of the tickets must be a winner. A player has $50 to spend. To maximize the prob- 
ability of winning at least one lottery, should he buy 50 tickets in one lottery or one 
ticket in 50 lotteries? 

In the previous problem, which of the two strategies will lead to a greater expected 
gain for the player? The expected gain if M(M < 50) lotteries are played is defined 
as Gu £ we G;,P(i), where G; is the gain obtained in winning i lotteries. 

The switch network shown in Figure P1.53 represents a digital communication link. 
Switches a; i = 1,...,6, are open or closed and operate independently. The proba- 
bility that a switch is closed is p. Let A; represent the event that switch 7 is closed. 


oe”, 
a2 a, 

1 2 
o—™ o—O 
ay, a6 
. o——&™ 

a3 as 


Figure P1.53 Switches in telephone link. 
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(a) In terms of the A,’s write the event that there exists at least one closed path 
from 1 to 2. 

(b) Compute the probability of there being at least one closed path from 1 
to 2. 


(independence of events in disjoint intervals for Poisson law) The average number 
of cars arriving at a tollbooth per minute is \ and the probability of k cars in the 
interval (0, T) minutes is 


AT]* 
P(k;0,T) = ear ATIE 
k! 
Consider two disjoint, that is, nonoverlapping, intervals, say (0, tı] and (t1, T]. Then 
for the Poisson law: 


P{n, cars in (0, tı] and ng cars in (tı, T]] (1.11-10) 
= P[n, cars in (0,t1]]P[ne cars in (t1, T]), (1.11-11) 


that is events in disjoint intervals are independent. Using this fact, show the following: 


(a) That P[n, cars in (0, 1]|n1 +2 cars in (0, T]] is not a function of À. 
(b) In (a) let T = 2, ti = 1, m = 5, and nz = 5. Compute P[5 cars in 
(0, 1]|10 cars in (0, 2]. 

An automatic breathing apparatus (B) used in anesthesia fails with probability Pg. 
A failure means death to the patient unless a monitor system (M) detects the failure 
and alerts the physician. The monitor system fails with probability Pm. The fail- 
ures of the system components are independent events. Professor X, an M.D. at 
Hevardi Medical School, argues that if Pu > Ps installation of M is useless.! 
Show that Prof. X needs to take a course on probability theory by computing the 
probability of a patient dying with and without the monitor system in place. Take 
Py = 0.1 = 2Pg. 
In a particular communication network, the server broadcasts a packet of data 
(say, L bytes long) to N receivers. The server then waits to receive an acknowl- 
edgment message from each of the N receivers before proceeding to broadcast the 
next packet. If the server does not receive all the acknowledgments within a certain 
time period, it will rebroadcast (retransmit) the same packet. The server is then 
said to be in the “retransmission mode.” It will continue retransmitting the packet 
until all N acknowledgments are received. Then it will proceed to broadcast the 


next packet. Let p 4 P{successful transmission of a single packet to a single receiver 
along with successful acknowledgment]. Assume that these events are independent 
for different receivers or separate transmission attempts. Due to random impair- 
ments in the transmission media and the variable condition of the receivers, we 
have that p< 1. 


tA true story! The name of the medical school has been changed. 
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(a) In a fixed protocol or method of operation, we require that all N of the 
acknowledgments be received in response to a given transmission attempt for 
that packet transmission to be declared successful. Let the event S(m) be 


defined as follows: S(m) â {a successful transmission of one packet to all N 
receivers in m or fewer attempts}. Find the probability 


P(m) Ê P[S(m)]. 


[Hint: Consider the complement of the event S(m).] 

(b) An improved system operates according to a dynamic protocol as follows. 
Here we relax the acknowledgment requirement on retransmission attempts, 
so as to only require acknowledgments from those receivers that have not yet 
been heard from on previous attempts to transmit the current packet. Let 
Sp(m) be the same event as in part (a) but using the dynamic protocol. Find 
the probability 


Pp(m) Ê P[Sp(m)]. 


[Hint: First consider the probability of the event Sp(m) for an individual 
receiver, and then generalize to the N receivers.] 


Note: If you try p = 0.9 and N = 5 you should find that P(2) < Pp(2). 

Toss two unbiased dice (each with six faces: 1 to 6), and write down the sum of 
the two face numbers. Repeat this procedure 100 times. What is the probability of 
getting 10 readings of value 7? What is the Poisson approximation for computing 
this probability? (Hint: Consider the event A = {sum = 7} on a single toss and let 
p in Equation 1.9-1 be P[A].) 

On behalf of your tenants you have to provide a laundry facility. Your choices 
are 


1. lease two inexpensive “Clogger” machines at $50.00/month each; or 
2. lease a single “NeverFail” at $100/month. 


The Clogger is out of commission 40 percent of the time while the NeverFail is out 
of commission only 20 percent of the time. 


(a) From the tenant’s point, which is the better alternative? 
(b) From your point of view as landlord, which is the better alternative? 


In the politically unstable country of Eastern Borduria, it is not uncommon to find 
a bomb onboard passenger aircraft. The probability that on any given flight, a bomb 
will be onboard is 107?. A nervous passenger always flies with an unarmed bomb 
in his suitcase, reasoning that the probability of there being two bombs onboard is 
1074. By this maneuver, the nervous passenger believes that he has greatly reduced 
the airplane’s chances of being blown up. Do you agree with his reasoning? If not, 
why not? 

In a ring network consisting of eight links as shown in Figure P1.60, there are 
two paths connecting any two terminals. Assume that links fail independently with 
probability q, 0 < q < 1. Find the probability of successful transmission of a packet 
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from terminal A to terminal B. (Note: Terminal A transmits the packet in both 
directions on the ring. Also, terminal B removes the packet from the ring upon 
reception. Successful transmission means that terminal B received the packet from 
either direction.) 


A 


Figure P1.60 A ring network with eight stations. 


A union directive to the executives of the telephone company demands that tele- 
phone operators receive overtime payment if they handle more than 5680 calls in an 
eight-hour day. What is the probability that Curtis, a unionized telephone operator, 
will collect overtime on a particular day where the occurrence of calls during the 
eight-hour day follows the Poisson law with rate parameter À = 710 calls/hour? 
Toss two unbiased coins (each with two sides: numbered 1 and 2), and write down 
the sum of the two side numbers. Repeat this procedure 80 times. What is the prob- 
ability of getting 10 readings of value 2? What is the Poisson approximation for 
computing this probability? 

The average number of cars arriving at a tollbooth is cars per minute, and the 
probability of cars arriving is assumed to follow the Poisson law. Given that 6 cars 
arrive in the first three minutes, what is the probability of 12 cars arriving in the 
first six minutes? 

An aging professor, desperate to finally get a good review for his course on proba- 
bility, hands out chocolates to his students. The professor’s short-term memory is 
so bad that he can’t remember which students have already received a chocolate. 
Assume that, for all intents and purposes, the chocolates are distributed randomly. 
There are 10 students and 15 chocolates. What is the probability that each student 
received at least one chocolate? 

Assume that code errors in a computer program occur as follows: A line of code 
contains errors with probability p = 0.001 and is error free with probability q = 
0.999. Also errors in different lines occur independently. In a 1000-line program, 
what is the approximate probability of finding 2 or more erroneous lines? 

Let us assume that two people have their birthdays on the same day if both the 
month and the day are the same for each (not necessarily the year). How many 
people would you need to have in a room before the probability is 5 or greater that 
at least two people have their birthdays on the same day? 
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(sampling) We draw ten chips at random from a semiconductor manufacturing line 
that is known to have a defect rate of 2 percent. Find the probability that more 
than one of the chips in our sample is defective. 

(percolating fractals) Consider a square lattice with N? cells, that is, N cells per side. 
Write a program that does the following: With probability p you put an electrically 
conducting element in a cell and with probability q = 1 — p, you leave the cell empty. 
Do this for every cell in the lattice. When you are done, does there exist a continuous 
path for current to flow from the bottom of the lattice to the top? If yes, the lattice 
is said to percolate. Percolation models are used in the study of epidemics, spread of 
forest fires, and ad hoc networks, etc. The lattice is called a random fractal because 
of certain invariant properties that it possesses. Try N = 10, 20, 50; p = 0.1, 0.3, 0.6. 
You will need a random number generator. MATLAB has the function rand, which 
generates uniformly distributed random numbers z; in the interval (0.0, 1.0). If the 
number z; < p, make the cell electrically conducting; otherwise leave it alone. Repeat 
the procedure as often as time permits in order to estimate the probability of perco- 
lation for different p’s. A nonpercolating lattice is shown in Figure P1.68(a); a perco- 
lating lattice is shown in (b). For more discussion of this problem, see M. Schroeder, 
Fractals, Chaos, Power Laws (New York: W.H. Freeman, 1991). 

You are a contestant on a TV game show. There are three identical closed doors 
leading to three rooms. Two of the rooms contain nothing, but the third contains 
a $100,000 Rexus luxury automobile which is yours if you pick the right. door. You 
are asked to pick a door by the master of ceremonies (MC) who knows which room 
contains the Rexus. After you pick a door, the MC opens a door (not the one you 
picked) to show a room not containing the Rexus. Show that even without any 
further knowledge, you will greatly increase your chances of winning the Rexus if 
you switch your choice from the door you originally picked to the one remaining 
closed door. 

Often we are faced with determining the more likely of two alternatives. In such a 
case we are given two probability measures for a single sample space and field of 
events, that is, (Q, F, P1) and (Q, F, P2), and we are asked to determine the prob- 
ability of an observed event FE in both cases. The more likely alternative is said to 
be the one which gives the higher probability of event E. 

Consider that two coins are in a box; one is “fair” with P,({H}] = 0.5 and one is 
“biased” with P2|{H}] = p. Without looking, we draw one coin from the box and 
then flip this single coin ten times. We only consider the repeated coin-flips as our 
experiment and so the sample space Q = { all ten-character strings of H and T}. 
We observe the event & = {a total of four H’s and six T’s}. 


(a) What are the two probabilities of the observed event E, that is, P,[E] and 
P2[E]? 

(b) Determine the likelihood ratio L Ê P, [E]/P2[E] as a function of p. (When 
L > 1, we say that the fair coin is more likely. This test is called a likelihood 
ratio test.) 
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i Random Variables 


2.1 INTRODUCTION 


Many random phenomena have outcomes that are sets of real numbers: the voltage v(t), 
at time t, across a noisy resistor, the arrival time of the next customer at a movie theatre, 
the number of photons in a light pulse, the brightness level at a particular point on the TV 
screen, the number of times a light bulb will switch on before failing, the lifetime of a given 
living person, the number of people on a New York to Chicago train, and so forth. In all 
these cases the sample spaces are sets of numbers on the real line. 

Even when a sample space 2) is not numerical, we might want to generate a new sample 
space from 2 that is numerical, that is, converting random speech, color, gray tone, and so 
forth to numbers, or converting the physical fitness profile of a person chosen at random 
into a numerical “fitness” vector consisting of weight, height, blood pressure, heart rate, 
and so on, or describing the condition of a patient afflicted with, say, black lung disease by 
a vector whose components are the number and size of lung lesions and the number of lung 
zones affected. 

In science and engineering, we are in almost all instances interested in numerical 
outcomes, whether the underlying experiment . is numerical-valued or not. To obtain 
numerical outcomes, we need a rule or mapping from the original sample space 2 to the 
real line R’. Such a mapping is what a random variable fundamentally is and we discuss it 
in some detail in the next several sections. 

Let us, however, make a remark or two. The concept of a random variable will enable 
us to replace the original probability space with one in which events are sets of numbers. 
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Thus, on the induced probability space of a random variable every event is a subset of R!. 
But is every subset of R! always an event? Are there subsets of R! that could get us into 
trouble via violating the axioms of probability? The answer is yes, but fortunately these 
subsets are not of engineering or scientific importance. We say that they are nonmeasurable.' 
Sets of practical importance are of the form {z = a}, {x:a < x < b}, {z:a < x < b}, 
{z:a <x <b}, {z:a< x < b}, and their unions and intersections. These five intervals are 
more easily denoted ja], [a,b], (a, b], [a, b), and (a,b). Intervals that include the end points 
are said to be closed; those that leave out end points are said to be open. Intervals can also 
be half-closed (half-open) too; for example, the interval (a, b] is open on the left and closed 
on the right. The field of subsets of R! generated by the intervals was called the Borel field 
in Chapter 1, Section 4. 

We can define more than one random variable on the same underlying sample space 
Q. For example, suppose that 2 consists of a large, representational group of people in the 
United States. Let the experiment consist of choosing a person at random. Let X denote 
the person’s lifetime and Y denote that person’s daily consumption of cigarettes. We can 
now ask: Are X and Y related? That is, can we predict X from observing Y? Suppose 
we define a third random variable Z that denotes the person’s weight. Is Z related to X 
or Y? 

The main advantage of dealing with random variables is that we can define certain 
probability functions that make it both convenient and easy to compute the probabilities 
of various events. These functions must naturally be consistent with the axiomatic theory. 
For this reason we must be a little careful in defining events on the real line. Elaboration 
of the ideas introduced in this section is given next. 


2.2 DEFINITION OF A RANDOM VARIABLE 


Consider an experiment 4 with sample space Q. The elements or points of Q, Ç, are the 
random outcomes of .#%. If to every Ç we assign a real number X(¢), we establish a corre- 
spondence rule between ¢ and R1, the real line. Such a rule, subject to certain constraints, 
is called a random variable, abbreviated as RV. Thus, a random variable X(-) or simply X 
is not really a variable but a function whose domain is 2 and whose range is some subset 
of the real line. Being a function, X generates for every Ç a specific X(¢) although for a 
particular X(¢) there may be more than one outcome ¢ that produced it. Now consider an 
event Eg CO(Eg € ¥ ). 

Through the mapping X, such an event maps into points on the real line (Figure 2.2-1). 
In particular, the event {¢: X(¢) < z}, often abbreviated {X < z}, will denote an event of 
unique importance, and we should like to assign a probability to it. As a function of the real 
variable x, the probability P[X < z] SP ‘x (x) is called the cumulative distribution function 
(CDF) of X. It is shown in more advanced books [2-1] and [2-2] that in order for F(z) 


tSee Appendix D for a brief discussion on measure. 
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Figure 2.2-1 Symbolic representation of the action of the random variable X. 





Xe) 


0 X2) R 


to be consistent with the axiomatic definition of probability, the function X must satisfy 
the following: For every Borel set of numbers B, the set {¢: X(C) € B} must correspond to 
an event Eg € .¥; that is, it must be in the domain of the probability measure P. Stated 
somewhat more mathematically, this requirement demands that X can be a random variable 
only if the inverse image under X of every Borel subsets in R!, making up the field t are 
events. What is an inverse image? Consider an arbitrary Borel set of real numbers B; the set 
of points Eg in Q for which X(C) assumes values in B is called the inverse image of the set 
B under the mapping X. Finally, all sets of engineering interest can be written as countable 
unions or intersections of events of the form (—co,z]. The event {¢: X(C) < z} € F gets 
mapped under X into (—oo,2] € .# Thus, if X is a random variable, the set of points 
(—o00, z] is an event. 

In many if not most scientific and engineering applications, we are not interested in 
the actual form of X or the specification of the set Q. For example, we might conceive of 
an underlying experiment that consists of heating a resistor and observing the positions 
and velocities of the electrons in the resistor. The set. is then the totality of positions 
and velocities of all N electrons present in the resistor. Let X be the thermal noise current 
produced by the resistor; clearly X: Q — R? although the form of X, that is, the exceedingly 
complicated equations of quantum electrodynamics that map from electron positions and 
velocity configurations to current, is not specified. What we are really interested in is the 
behavior of X. Thus, although an underlying experiment with sample space 2 may be 
implied, it is the real line R! and its subsets that will hold our interest and figure in our 
computations. Under the mapping X we have, in effect, generated a new probability space 
(R!, B, Px), where R! is the real line, Z is the Borel o-field of all subsets of R? generated 


The o-field of events defined on © is denoted by .Z The family of Borel subsets of points on R! is 
denoted by .# For definitions, see Section 1.4 in Chapter 1. 
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by all the unions, intersections, and complements of the semi-infinite intervals (—oo, z], and 
Px is a set function assigning a number Px [A] > 0 to each set A € Zt 

In order to assign certain desirable continuity properties to the function F(x) at 
x = +00, we require that the events {X = oo} and {X = —oo} have probability zero. With 
the latter our specification of a random variable is complete, and we can summarize much 
of the above discussion in the following definition. 


Definition 2.2-1 Let .# be an experiment with sample space Q. Then the real 
random variable X is a function whose domain is Q that satisfies the following: (i) For 


every Borel set of numbers B, the set Eg 2 {¢-€ Q,X(¢) € B} is an event and (ii) 
P[X = —œ] = P[X = +09] = 0. 

Loosely speaking, when the range of X consists of a countable set of points, X is said 
to be a discrete random variable; and if the range of X is a continuum, X is said to be 
continuous. This is a somewhat inadequate definition of discrete and continuous random 
variables for the simple reason that we often like to take for the range of X the whole 
real line R?. Points in R! not actually reached by the transformation X with a nonzero 
probability are then associated with the impossible event. | 








Example 2.2-1 
(random person) A person, chosen at random off the street, is asked if he or she has a 
younger brother. If the answer is no, the data is encoded by random variable X as zero; if 
the answer is yes, the data is encoded as one. The underlying experiment has sample space 
Q = {no, yes}, sigma field Z= [¢ġ, 9, {no}, {yes}], and probabilities P|¢] = 0, P[Q] = 1, 
P{no] = 3 (an assumption), Plyes] = 4. The associated probabilities for X are P[¢] = 0, 
P[X < oo] = P[Q] = 1, P[X = 0] = Pino] = 2, P[X = 1] = Plyes] = 4. Take any z1, £2 
and consider, for example, the probabilities that X lies in sets of the type [21,22], [z1, 22), 
or (£1, £2]. Thus, 





P[3 < X < 4] = Pld] =0 
P[0 < X < 1] = Pino] = 3 
P[0< X <2] = P[Q] =1 
P[0 < X < 1] = Plyes] = 4, 


and so on. Thus, every set {X = z}, {z1 < X < z2}, {X < xe}, and so forth is related to 
an event defined on Q. Hence X is a random variable. 


tThe extraordinary advantage of dealing with random variables is that a single pointwise function, that 
is, the cumulative distribution function Fx (x), can replace the set function Px [-] that may be extremely 
cumbersome to specify, since it must be specified for every event (set) A € .#. See Section 2.3. 

tAn alternative definition is the following: X is discrete if Fx (z) is a staircase-type function, and X 
is continuous if Fx (x) is a continuous function. Some random variables cannot be classified as discrete or 
continuous; they are discussed in Section 2.5. 
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Example 2.2-2 — Žž — ~ >> S 
(random bus arrival time) A bus arrives at random in [0, T]; let t denote the time of arrival. 
The sample space Q is Q = {t: t € [0, T]}. A random variable X is defined by 
l, tE TT 

ki 4 1 2 3 


0, otherwise. 


X(t) = 


Assume that the arrival time is uniform over [0,7]. We can now ask and compute what is 
P[X(t) = 1] or PIX (t) = 0] or P[X(t) < 5]. 





Example 2.2-3 
(drawing from urn) An urn contains three colored balls. The balls are colored white (W), 
black (B), and red (R), respectively. The experiment consists of choosing a ball at random 
from the urn. The sample space is Q = {W,B,R}. The random variable X is 
defined by 


_ ju ¢=WorB, 
x= {5 C=R. 


We can ask and compute the probability P[X < 2], where x) is any number. Thus, 
{X < 0} = {R}, {2 < X < 4} = {W, B}. The computation of the associated probabilities 
is left as an exercise. 


Example 2.2-4 
(wheel of chance) A spinning wheel and pointer has 50 sectors numbered n = 0,1,..., 49. 
The experiment consists of spinning the wheel. Because the players are interested only in 
even or odd outcomes, they choose 2 = {even, odd} and the only events in the o-field 
are {¢,1, even, odd}. Let X = n, that is, if n shows up, X assumes that value. Is X a 
random variable? Note that the inverse image of the set {2,3} is not an event. Hence 
X is not a valid random variable on this probability space because it is not a function 
on 2. 











2.3 CUMULATIVE DISTRIBUTION FUNCTION 


In Example 2.2-1 the induced event space under X includes {0,1}, {0}, {1}, ¢, for which 
the probabilities are P[X = 0 or 1] = 1, P[X = 0] = 3, P[X = 1] = 4, and P[¢] = 0. From 
these probabilities, we can infer any other probabilities such as P[X < 0.5]. In many cases 
it is awkward to write down P[-] for every event. For this reason we introduce a pointwise 
probability function called the cumulative distribution function CDF. The CDF is a function 
of z, which contains all the information necessary to compute P|E] for any E in the Borel 


field of events. The CDF, Fx (x), is defined by 


Fx (z) = PHC: X(¢) < 2}] = Px|(—co, z]]. (2.3-1) 
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Equation 2.3-1 is read as “the set of all outcomes ¢ in the underlying sample space such 
that the function X(C) assumes values less than or equal to x.” Thus, there is a subset of 
outcomes {¢: X(C) < x} C Q that the mapping X(-) generates as the set [—oo,2] C R!. 
The sets {¢: X(¢) < x} C N and |[-œ, x] C R! are equivalent events. We shall frequently 
leave out the dependence on the underlying sample space and write merely P[X < z] or 
Pla< X < 8). ' 

For the present we shall denote random variables by capital letters, that is, X, Y, Z, 
and the values they can take by lowercase letters z, y, z. The subscript X on Fx(x) asso- 
ciates it with the random variable for which it is the CDF. Thus, Fx (y) means the CDF 
of random variable X evaluated at the real number y and thus equals the probability 
P[X < y]. If Fx(x) is discontinuous at a point, say, £o, then Fyx(zo) will be taken to 
mean the value of the CDF immediately to the right of zo(we call the continuity from the 
right). 


Properties! of F x(x) 


(i) Fx (co) = 1, Fy (—oo) = 0. 
(ii) zı < z2 > Fx(z1) < Fx (x2), that is, Fy (x) is a nondecreasing function of x. 
(iii) Fx (x) is continuous from the right, that is, 


Fx () = lim Fx(z + €) e>0. 


Proof of (ii) Consider the event {zı < X < z2} with z2 > zı. The set [x1, £2] is 
nonempty and € .#. Hence 
0 < Play < X <2] <1. 


But 
{X < r2} = {X < z1} U {xy < X <ar} 
and 
{X <z} NA {xi <X<zr}=¢. 
Hence 
Fx (z2) = Fx (z1) + Plti < X < z2] 
or 


Plz, < X < z2) = Fx (z2) — Fx (z1) > 0 for z2 > z1. (2.3-2) 
We leave it to the reader to establish the following results: 
Pla< X <b] = Fx (b) — Fx (a) + P[X = al; 
Pla < X < b] = Fx(b) — P[X = b] — Fx(a); 
Pla< X <b] = Fx (b) — P[X = b] — Fx(a) + Plz = a]. 


+ Properties (i) and (iii) require proof. This is furnished with the help of extended axioms in Chapter 8. 
Also see Davenport [2-3, Chapter 4]. 
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Example 2.3-1 
(parity bits) The experiment consists of observing the voltage X of the parity bit in a word 
in computer memory. If the bit is on, then X = 1; if off then X = 0. Assume that the off 
state has probability q and the on state has probability 1 — q. The sample space has only 
two points: Q = {off, on}. 





Computation of F x(x) 
(i) xz <0: The event {X < x} = ¢ and F(z) = 0. 
(ii) 0 < x < 1: The event {X < z} is equivalent to the event {off} and excludes the 
event {on}. 


Hence Fx (z) =q. 
(iii) x > 1: The event {X < x} = is the certain event since 


X(on)=1l<a 
X (off) =0 < z. 


The solution is shown in Figure 2.3-1. 


Example 2.3-2 
(waiting for a bus) A bus arrives at random in (0,7]. Let the random variable X denote 
the time of arrival. Then clearly Fy (t) = 0 for t < 0 and Fx(T) = 1 because the former 
is the probability of the impossible event while the latter is the probability of the certain 
event. Suppose it is known that the bus is equally likely or uniformly likely to come at any 
time within (0, T]. Then 





0, t<0, 
t 
Fx()=) a 0<t<T, (2.3-3) 
1, t>T. 
Fy(x) 
1 e 


Figure 2.3-1 Cumulative distribution function associated with the parity bit observation experiment. 
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F(t) 


0 T t 


Figure 2.3-2 Cumulative distribution function of the uniform random variable X of Example 2.3-2. 


Actually Equation 2.3-3 defines “equally likely,” not the other way around. The CDF is 
shown in Figure 2.3-2. In this case we say that X is uniformly distributed. 





If Fx (x) is a continuous function of x, then 
Fx(z) = Fx(z7). (2.3-4) 
However, if Fx (x) is discontinuous at the point z, then, from Equation 2.3-2, 


Fy (2) — Fx (27) = Pla7 < X < zx] 
= lim Plz -—e< X < z] 
e—0 


2 PIX =a]. (2.3-5) 


Typically P(X = z] is a discontinuous function of z; it is zero whenever Fy (x) is continuous 
and nonzero only at discontinuities in Fx (x). 


Example 2.3-3 —— ~> — >> 
(binomial distribution function) Compute the CDF for a binomial random variable X with 
parameters (n, p). 


Solution Since X takes on only discrete values, that is, X € {0,1,2,...,n}, the event 
{X < z} is the same as {X < [x]}, where [z] is the largest integer equal to or smaller 
than z. Then Fy (x) is given by the stepwise constant function 


Fx(z) = > (i) ea — pyr. 


j=0 


For p = 0.6, n = 4, the CDF has the appearance of a staircase function as shown in 
Figure 2.3-3. 
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Figure 2.3-3 Cumulative distribution function for a binomial RV with n = 4, p = 0.6. 


Example 2.3-4 —— SSSSSSSSSSSSSSSSSSSeseeee o 
(computing binomial probabilities) Using the results of Example 2.3-3, compute the following: 


(a) P[1.5 < X <3]; 
(b) P[0 < X < 3); 
(c) P[1.2 < X < 1.8]; 
(a) P[1.99 << 3]. 
Solution 
(a) P[1.5 < X < 3} = Fx(3) — P[X = 3] — Fx(1.5) 
= 0.8704 — 0.3456 — 0.1792 = 0.3456; 
(b) P[0 < X < 3] = Fx(3) — Fx (0) + P[X =0] 
= 0.8704 — 0.0256 + 0.0256 = 0.8704; 
(c) P[1.2 < X < 1.8] = Fx(1.8) — Fx(1.2) 
= 0.1792 — 0.1792 = 0; 
(d) P[1.99 < X < 3] = Fx(3) — P[X = 3] — Fx (1.99) + P[X = 1.99] 
= 0.8704 — 0.3456 — 0.1792 + 0 = 0.3456 
Note that even for a discrete RV, we have taken the CDF to be a function of a continuous 
variable, x in this example. However, for a discrete RV, it is sometimes simpler (but more 
restrictive) to consider the CDF to be discrete also. Let X be a discrete RV taking on values 
{zk} with probability mass function (PMF) Px (zx). Then the discrete CDF would only be 


defined on the values {z,} also. Assuming that these values are an increasing set, that is, 
Tk < Le41 for all k, the discrete CDF would be 


k 
Fx (rr) 2 5 P(z;) for all k. 


j=—00 


In this format, we compute the CDF only at points corresponding to the countable outcomes 
of the sample space. 
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Looking again at the binomial example b(n, p) above, but using the discrete CDF, we 
would say the RV K takes on values in the set {0 < k < 4} with the discrete CDF 


k 
Fx (k) = >> (0.6) (1 — 0.6)" for O<k< 4. 
j=0 


While this is more natural for a discrete RV, the reader will note that the discrete CDF 
cannot be used to evaluate probabilities such as P[1.5 < K < 3] since it cannot be evaluated 
at 1.5. For this reason, we generally will consider CDF's as defined for a continuous domain, 
even though the RV in question might be discrete valued. 





2.4 PROBABILITY DENSITY FUNCTION (pdf) 


If x(x) is continuous and differentiable, the pdf is computed from 
dF X (x) 


fx(x) = a (2.4-1) 
Properties. If fx (x) exists, then 
(i) fx(z) 2 0. (2.4-2) 
Gi) f fx(6)dg = Fx(00) - Fx(-0) =1. (2.4-3) 
Gi) Fe@)= f fled = PIX < a). (2.4-4) 


v) Felz) = Fa) = f ” fclOde — f T Od 


~ I C fye(€)de = Plo < X < 1o). (2.4-5) 


interpretation of f x(x). 
Pla< X < z + Az] = Fx(z + Ar) — Fx (z). 


If Fx (x) is continuous in its first derivative then, for sufficiently small Az, 


stax 
Fx(z+ Az) ~Fx(z)= | f(g ~ fx(a)As. 
Hence for small Ax 
Pla < X < z + Ag] ~ fx(z)Ac. (2.4-6) 


Observe that if fx (x) exists, meaning that it is bounded and has at most a finite number of 
discontinuities then Fx (x) is continuous and therefore, from Equation 2.3-5, P[X = z] = 0. 
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The univariate Normal (Gaussian‘) pdf. The pdf is given by 





fx(z) = page IT, —o0 < T < +00. (2.4-7) 


There are two distinct parameters: the mean p and the standard deviation a(> 0). (Note 
that g? is called the variance). We show that this density is valid by integrating over all x 


as follows 
too l 1 ze) 
——- exp |—- dx 
-oœ Vro P | 2 ( o 
r-u 


1 +20 „2 A 
= Jan J e 7 dy, with the substitution y = ; 
T J —oco 


Oo 
+00 2 
-= | e`" dy = 2 T avr a, 
V2r Jo Vr V2 2yr 


where we make use of the known integral 


oO z2 
f e Z dr = T, 
0 2 


Now the Gaussian (Normal) random variable is very common in applications and a special 
notation is used to specify it. We often say that X is distributed as N(yu,07) or write 
X : N(p,07) to specify this distribution.+ 

For any random variable with a well-defined pdf, we can in general compute the mean 
and variance (the square of the standard deviation), if it exists, from the two formulas 











nd J ” fe (a)da (2.4-8) 
and 
o ê J > (x — u}? fx(a)dza. (2.4-9) 


We will defer to Chapter 4 the proof that the parameters we call and g? in the Gaussian 
distribution are actually the true mean and variance as defined generally in these two 
equations. 

For discrete random variables, we compute the mean and variance from the sums 


pÊ Y aiPx(zi) (2.410) 


i=— 00 


tAfter the German mathematician/physicist Carl F. Gauss (1777-1855). 
+The reader may note that capital letter on the word Normal. We use this choice to make the reader 
aware that while Gaussian or Normal is very common, it is not normal or ubiquitous in the everyday sense. 
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and 


o0 


oS Y (ai — u)?Pxlz:). (2.4-11) 


i=—00 
Here are some simple examples of the computation of mean and variance. 


Example 2.4-1 SS 
Let fx(z) = 1, for 0 < z < 1 and zero elsewhere. This pdf is a special case of the uniform 
law discussed below. The mean is computed as 


p= T zfx(z)dz = EZ =0.5 


=œ 0 


and the variance is computed as 
lo) 1 
o? = J (z — p)*fxda = f (x — 0.5)?dz = 1/12. 
-o 0 


Example 2.4-2 
Suppose we are given that Py(0) = Px(2) = 0.25, Px(1) = 0.5, and zero elsewhere. For 
this discrete RV, we use Equations 2.4-10 and 2.4-11 to obtain 


w=0x0.254+1%x 0542 x 0.25 =1 





and 


a? = (0 — 1)? x 0.25 + (1 — 1)? x 0.5 + (2 — 1)? x 0.25 = 0.5. 


The mean and variance are common examples of statistical moments, whose discussion 
is postponed till Chapter 4. The Normal pdf is shown in Figure 2.4-1. 

The Normal pdf is widely encountered in all branches of science and engineering as 
well as in social and demographic studies. For example, the IQ of children, the heights of 
men (or women), and the noise voltage produced by a thermally agitated resistor are all 
postulated to be approximately Normal over a large range of values. 





H x 


Figure 2.4-1 The Normal pdf. 
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Conversion of the Gaussian pdf to the standard Normal. Suppose we are given X: 
N(u, o?) and must evaluate Pia < X < b]. We have 


{=> 








Pla< X <b) = =)" de. 


V2r = 
With 8 4 (x — p)/o, dB = (1/a)dz, b' 4 (b — pz)/o, a’ 4 (a — p)/o, we obtain 


b 
Pla < X <b] = = e73? dr 
a! 


1 
=— Ma f e723” de. 
=f 


The function 
T 
erf(z) = —— | e? dt (2.4-12) 
is sometimes called the error function [erf(x)] although other definitions of erf(x) exist.t 


The erf(x) is tabulated in Table 2.4-1 and is plotted in Figure 2.4-2. 
Hence if X:N (p, 07), then 





Pla < X < b] = erf (=+) -— a (* - e) . (2.4-13) 


Example 2.4-3 —— > 
(resistor tolerance) Suppose we choose a resistor with resistance R from a batch of resistors 
with parameters u = 1000 ohms and o = 200 ohms. What is the probability that R will 
have a value between 900 and 1100 ohms? 


Solution Assuming that R: N[1000, (200)?] we compute from Equation 2.4-13 
P(900 < R < 1100] = erf(0.5) — erf(—0.5). 
But erf(—x) = —erf(x) (deduced from Equation 2.4-12). Hence 


P{900 < R < 1100] = 0.38. 








tFor example, a widely used definition of erf(x) is erfo(z) Ê (2/./7) to e~‘’dt, which is used in 
MATLAB. The relation between these two erf’s is erf(z) = derfo (z/V2). 
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Table 2.4-1 Selected Values of erf(x) 


erf(z) = zÍ exp (-37) dt 





T erf(z) z 
0.05 0.01994 2.05 
0.10 0.03983 2.10 
0.15 0.05962 2.15 
0.20 0.07926 2.20 
0.25 0.09871 2.25 
0.30 0.11791 2.30 
0.35 0.13683 2.35 
0.40 0.15542 2.40 
0.45 0.17364 2.45 
0.50 0.19146 2.50 
0.55 0.20884 2.55 
0.60 0.22575 2.60 
0.65 0.24215 2.65 
0.70 0.25803 2.70 
0.75 0.27337 2.75 
0.80 0.28814 2.80 
0.85 0.30233 2.85 
0.90 0.31594 2.90 
0.95 0.32894 2.95 
1.00 0.34134 3.00 
1.05 0.35314 3.05 
1.10 0.36433 3.10 
1.15 0.37492 3.15 
1.20 0.38492 3.20 
1.25 0.39434 3.25 
1.30 0.40319 3.30 
1.35 0.41149 3.35 
1.40 0.41924 3.40 
1.45 0.42646 3.45 
1.50 0.43319 3.50 
1.55 0.43942 3.55 
1.60 0.44519 3.60 
1.65 0.45052 3.65 
1.70 0.45543 3.70 
1.75 0.45993 3.75 
1.80 0.46406 3.80 
1.85 0.46783 3.85 
1.90 0.47127 3.90 
1.95 0.47440 3.95 
2.00 0.47724 4.00 


erf(zx) 


0.47981 
0.48213 
0.48421 
0.48609 
0.48777 
0.48927 
0.49060 
0.49179 
0.49285 
0.49378 
0.49460 
0.49533 
0.49596 
0.49652 
0.49701 
0.49743 
0.49780 
0.49812 
0.49840 
0.49864 
0.49884 
0.49902 
0.49917 
0.49930 
0.49941 
0.49951 
0.49958 
0.49965 
0.49971 
0.49976 
0.49980 
0.49983 
0.49986 
0.49988 
0.49990 
0.49992 
0.49993 
0.49994 
0.49995 
0.49996 
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Figure 2.4-2  erf(x) versus x. 


Using Figure 2.4-3 as an aid in our reasoning, we readily deduce the following for 
X: N(0,1). Assume z > 0; then 


P[X <a] = > + erf(z), (2.4-14a) 

PIX > —2] = 5 +erf(z), (2.4-14b) 
PIX > z) = ~ erf(2), (2.4-14c) 
P|-z < X < z] =2erf(z), (2.4-14d) 
PIIX| > 2] =1—2erf(2). (2.4146) 


Example 2.4-4 
(manufacturing) A metal rod is nominally 1 meter long, but due to manufacturing imper- 
fections, the actual length L is a Gaussian random variable with mean u = 1 and standard 
deviation o = 0.005. What is the probability that the rod length ZL lies in the interval 
[0.99, 1.01]? Since the random variable L:N(1, (0.005)), we have 
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fi) flx) 

2 
Z 
ZZ 
QD 
Zi 

x (b) (c) x 

fx(x) Fy(x) 
-x x -xX x 

(d) (e) 


Figure 2.4-3 The areas of the shaded region under curves are (a) P[X < xj; (b) PIX > —x]; (c) 
PIX > x]; (d) P[—x < X < x]; and (e) P[|X| > x. 








1.01 
1 1 z—-1.00\2 
P(0.99 < L < 1.01] = ——— ~ e 2 rats) dx 
7 0.99 V27(0.005) 
1.01=1.00 1 
-f l —— eit dr 
0.99-1.00 Qn 





0.005 
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= 2erf(2) = 2 x 0.4772 (from Table 2.4-1) 
= 0.954. 





Four Other Common Density Functions 
1. Rayleigh (o > 0): 
fx(x) = ae?" ula), (2.4-15) 
where the continuous unit-step function is defined as 


u(x) ê 1, 0<2<o, 
~ |0,-co< 2 <0. 


Thus, fx(x) = 0 for x < 0. Examples of where the Rayleigh pdf shows up are in rocket- 
landing errors, random fluctuations in the envelope of certain waveforms, and radial distri- 
bution of misses around the bull’s-eye at a rifle range. 

2. Exponential (u > 0): 


fx(z) = we*/Hu(2). (2.4-16) 


The exponential law occurs, for example, in waiting-time problems, in calculating lifetime 
of machinery, and in describing the intensity variations of incoherent light. 
3. Uniform (b > a): 


1 
fx(z) = = a<r<b 
otherwise. (2.4-17) 


The uniform pdf is used in communication theory, in queueing models, and in situations 
where we have no a priori knowledge favoring the distribution of outcomes except for the 
end points; that is, we don’t know when a business call will come but it must come, say, 
between 9 A.M. and 5 P.M. We sometimes use the notation U (a,b) to denote a uniform 
distribution lower-bounded by a and upper-bound by b. 

The three pdf’s are shown in Figure 2.4-4. 

4. Laplacian: The pdf is defined by 


fx(z)= sel, -wo<2<oo c>0. (2.4-18) 


The Laplacian is widely used in speech and image processing to model adjacent-sample 
difference and is the difference in signal level from a sample point and its neighbor. Since 
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Figure 2.4-4 The Rayleigh, exponential, and uniform pdf's. 


fix) 


x 


Figure 2.4-5 The Laplacian pdf used in computer analysis of speech and images. 


the levels of the sample point and its neighbor are often the same, the Laplacian peaks at 
zero. The Laplacian pdf is sometime written as 


fx(z) = 5 exp[—v2|z|/o], ->œ <z <% ao >0, (2.4-19) 


where ø is the standard deviation of the Laplacian RV X. Precisely what this means will be 
explained in Chapter 4. The Laplacian pdf is shown in Figure 2.4-5. In image compression, 
the Laplacian model is appropriate for the so-called “AC coefficients” that arise after a 
decorrelating transform called the DCT? which is applied on 8 x 8 blocks of pixels. 


Example 2.4-5 
(radiated power) The received power W on a cell phone at a certain distance from the base 
station is found to follow a Rayleigh distribution with parameter ø = 1 milliwatt. What 





tDCT stands for discrete cosine transform and is a variation on the DFT used in signal analysis. A 
2-D version is used for images, consisting of a 1-D DCT on the rows followed by a 1-D transform on the 
columns. 
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is the probability that the power W is less than 0.8 milliwatts? Since the power can be 
modeled by the Rayleigh random variable, we have 


08 2 
P[W < 0.8] = f ze Tdr, since o? = 1, 
0.0 


0.8 
x 1 
= f e7 F d(=2”) 
0.0 2 


0.32 

1 

= f e dy, with the substitution y 4 ae 
0.0 


= 1 — e™®?? ~ 0.29. 


Example 2.4-6 
(image compression) In designing the quantizer for a JPEG image compression system, 
we need to know what the range should be for the transformed AC coefficients. Using the 
Laplacian model with parameter o for such a coefficient X, what is the probability of the 
event {|X| > ko} as a function of k = 1,2,3, ...? If we then make this probability sufficiently 
low, by choice of k, we will design the quantizer for the range [—ko, +ko] and only saturate 
the quantizer occasionally. We need to calculate 








P||X| > ko] = T a exp (-v2z/c) dx + > oP (+v2z/c) dx 
=2 > m exp (-v3z/0) dz 


ko 


=2 [ zo (-v2y) dy, with y Ê z/o, 
-2go ( va] 
= exp (-v2k) - 





For k = 2, we get probability 0.059 and for k = 5 we get 0.85 x 1073, or about one in a 
thousand coefficients. 


Table 2.4-2 lists some common continuous random variables with their probability densi- 
ties and distribution functions. 


More Advanced Density Functions 


5. Chi-square (n an integer) 


fx(z) = K,2(3)-1¢e-Fu(z), (2.4-20) 
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Table 2.4-2 Common Continuous Probability Densities and Distribution Functions 








Family pdf fx (a) CDF Fx (zx) 

0, z <a, 
Uniform U (a,b) z4; [ule — a) — u(x — b)] p,a<sa2<b, 

: b<az 

Exponential u > 0 xe a/# u(x) 1- O aju 7s? 
Gaussian N(,07) 5 exp[-4 (25#)") i + erf(=#) 
Laplacian o>0 -b expļ-v2le|/o] 4 [1-+sen(z)(1 — exp(—V3}aI/o))] 
Rayleigh o > 0 Sy e7 8/207 u(x) [1 — en 20?) u(x) 


The Chi-square density for n = 2, 4, 10 


pdf value 





Hx. 


oļi ka 
0 5 10 15 20 25 30 35 40 45 50 
Argument value 


Figure 2.4-6 The Chi-square probability density function for n = 2 (solid), n = 4 (dashed), and n = 10 
(stars). Note that for larger values of n, the shape approaches that of a Normal pdf with a positive 


mean-parameter ji. 


where the normalizing constant K is computed as K, = FITE and I(-) is the Gamma 
function discussed in Appendix B. The Chi-square pdf is shown in Figure 2.4-6. 
6. Gamma: (b > 0,c > 0) 


fx(x) = Ky2°*e-“u(z), (2.4-21) 


where K, = &/T (b). 
7. Student-t: (n an integer) 


2\ (7 
fx(x) = Ket (1 + =) ,-00 < T < 00 (2.4-22) 
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Beta pdf 
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1 
0.5 
0 = 
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Figure 2.4-7 The beta pdf shown for 3 = 1, a = n — 2, and various values of n. When G = 0,a = 0 
the beta pdf becomes uniformly distributed over O < x < 1. 


where 


_ Tire + 1)/3 


Ka = Finan 


The Chi-square and Student-t densities are widely used in statistics.t We shall encounter 
these densities later in the book. The gamma density is mother to other densities. For 
example with b = 1, there results the exponential density; and with b = n/2 and c = 1/2, 
there results the Chi-square density. 

8. Beta (a > 0,8 > 0): 


fx(x;0, B) = fori (1 —2)?, 0<2<1, 
o 0 , else. 


The beta distribution is a two-parameter family of functions that appears in statistics. It is 
shown in Figure 2.4-7. 

There are other pdf’s of importance in engineering and science, and we shall encounter 
some of them as we continue our study of probability. They all, however, share the properties 
that 


tThe Student-t distribution is so named because its discoverer W. S. Gossett (1876-1937) published his 
papers under the name “Student.” Gossett, E. S. Pearson, R. A. Fisher, and J. Neyman are regarded as 
the founders of modern statistics. 
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fx(x) 20 (2.4-23) 


J j fx(z)dz =1. (2.4-24) 


When Fx(z) is not continuous, strictly speaking, its finite derivative does not exist and, 
therefore, the pdf doesn’t exist. The question of what probability function is useful in 
describing X depends on the classification of X. We consider this next. 


2.5 CONTINUOUS, DISCRETE, AND MIXED RANDOM VARIABLES 


If Fx (x) is continuous for every x and its derivative exists everywhere except at a countable 
set of points, then we say that X is a continuous RV. At points x where F(x) exists, the 
pdf is fx(x) = F% (z). At points where F(x) is continuous, but F(x) is discontinuous, 
we can assign any positive number to fx(z); fx(x) will then be defined for every z, and 
we are free to use the following important formulas: 


Pea) =f Sele ae, (2.5-1) 
Pla <X<aal= f flO, (2.5-2) 

and 
P[B] = cs een fx (€)d€, (2.5-3) 


where, in Equation 2.5-3, B € B, that is, B is an event. Equation 2.5-3 follows from the fact 
that for a continuous random variable, events can be written as a union of disjoint intervals 
in R. Thus, for example, let B = {€: € € UL hi, lil; = ¢ for i # j}, where I; = (ai, bi]. 
Then clearly, 


bi bo bn 
Pig = f Oder f Felde | fx (Ode 
=f fxlE)dE. (2.5-4) 
£: EEB 


A discrete random variable has a staircase type of distribution function (Figure 2.5-1). 
A probability measure for discrete RV is the probability mass functiont (PMF). The 
PMF Px (x) of a (discrete) random variable X is defined as 


tLike mass, probability is nonnegative and conserved. Hence the term mass in probability mass function. 
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Fy(x) 





x 


Figure 2.5-1 The cumulative distribution function for a discrete random variable. 


Px (x) = P[X = x] 


(2.5-5) 
= P[X < z) — PIX < z]. 
Thus, Px (x) = 0 everywhere where Fy (zx) is continuous and has nonzero values only where 
there is a discontinuity, that is, jump, in the CDF. If we denote P[X < a] by Fx(z7), 
then at the jumps z; 7 = 1,2,..., the finite values of Py(z;) can be computed from 
Px (xi) = Fx (xj) — Fx (x; ). 

The probability mass function is used when there are at most a countable set of outcomes 
of the random experiment. Indeed Px(z;) lends itself to the following frequency interpre- 
tation: Perform an experiment n times and let n; be the number of tries that z; appears as 
an outcome. Then, for n large, 


Px(x;) ~ =. (2.5-6) 


Because the PMF is so closely related to the frequency notion of probability, it is sometimes 
called the frequency function. 

Since for a discrete RV Fy (x) is not continuous fx (x), strictly speaking, does not exist. 
Nevertheless, with the introduction of Dirac delta functions,' we shall be able to assign pdf’s 
to discrete RVs as well. The CDF for a discrete RV is given by 


Fx(z) PIX <a]= Ý Px(2i) l (2.5-7) 


all z;<z 


and, more generally, for any event B when X is discrete: 


PIB}= X. Px(xi). (2.5-8) 


all z;€B 


tAlso called impulses or impulse functions. Named after the English physicist Paul A. M. Dirac (1902- 
1984). Delta functions are discussed in Section B.2 of Appendix B. 
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Some Common Discrete Random Variables 


1. Bernoulli random variable B with parameter p (0 < p < 1,q ĉi p): 


q, k =0, 
Pg(k)= 4 p,k=1, (2.5-9) 
0, else, 


= qô(k) + põ(k — 1), by use of discrete delta function’ ô(k).  (2.5-10) 


The Bernoulli random variable appears in those situations where the outcome is one of 
two possible states, for example, whether a particular bit in a digital sequence is “one” or 
“zero.” The Bernoulli PMF can be conveniently written as Pg(k) = pq! for k =0 or 1 
and then zero elsewhere. The corresponding CDF is given as 


0,k <0, 
Fpg(k) = q, k= 0, 
1,k>1. 

= qu(k) + pu(k—1) by use of unit-step function u(k). 


2. Binomial random variable K with parameters n and p (n = 1,2,...;0 < p < 1) 
and k an integer: 


Px(k) = (x) pig", 0 <k <n, (2.5-11) 
0, else, 
= (x) p*q?—* [u(k) — u(n — k)]. (2.5-12) 


The binomial random variable appears in games of chance, military defense strategies, 
failure analysis, and many other situations. Its corresponding CDF is given as (l,k,n are 
integers) 


0, k<0, 
Fe =) Eio (7) oat 0<k <n, 
1, k>n. 
3. Poisson random variable X with parameter (> 0) and k an integer: 
The Poisson law is widely used in every branch of science and engineering (see Section 1.10). 


We can write the Poisson PMF in a single line by use of the unit-step function u(k) as 


Px(k) = E emrat), 


f Recall that the discrete delta function has value 1 when the argument is 0 and has value 0 for every 
other value. 


Sec. 2.5. CONTINUOUS, DISCRETE, AND MIXED RANDOM VARIABLES 115 


` where the discrete unit-step function is defined by 


Aal, 0<k<œ, 


ulk) 0, -œ <k <0. 


4. Geometric random variable K with parameters p > 0,q > 0, (p +q = 1) and k an 
integer: 


kO0<k <0, 
Px(k) = ii else, 


= pq*u(k). 


The corresponding CDF is given by a finite sum of the geometric series (ref. 
Appendix A) as 


0, k<0, 
Fr (k) = p (542), 0<k<, 


This distribution! was first seen in Example 1.9-4. As there, also note the variant pg”—1,n > 
1, also called geometric RV. 





Example 2.5-1 
(CDF of Poisson RV) Calculating the CDF of a Poisson random variable proceeds as 
follows. Let X be a Poisson random variable with parameter p(>0). Then by definition the 


PMF is Px(k) = 4-e-#u(k). Then the CDF Fx (k) = 0 for k < 0. For k > 0, we have 


Table 2.5-1 lists the common discrete RVs, their PMFs, and their CDFs. 

Sometimes an RV is neither purely discrete nor purely continuous. We call such an RV 
a mized RV. The CDF of a mixed RV is shown in Figure 2.5-2. Thus, Fx (a) is discontinuous 
but not a staircase-type function. 


tNote that we sometimes speak of the probability distribution in a general sense without meaning the 
distribution function per se. Here we give a PMF to illustrate the geometric distribution. 
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Table 2.5-1 Table Common Discrete RVs, PMFs, and CDFs 





Family PMF Plk) CDF Fg (k) 
Bernoulli p,q q6(k) + pô(k — 1) qu(k) + pu(k — 1) 
0, k<0, 
Binomial n,k (x) p*q”—* [u(k) — u(n — k)] Diao (7) pg? ',0<k<n 
1, k>n. 
1 
Poisson u > 0 KE e—hu(k) Mes ey x u(k) 
l 1 _ ght? 
Geometric p,q pg*u(k) p A) u(k) 





x 


Figure 2.5-2 The CDF of a mixed RV. 


The distinction between continuous and discrete RVs is somewhat artificial. Continuous 
and discrete RVs are often regarded as different objects even though the only real difference 
between them is that for the former the CDF is continuous while for the latter it is not. By 
introducing delta functions we can, to a large extent, treat them in the same fashion and 
compute probabilities for both continuous and discrete RVs by integrating pdf's. 

Returning now to Equation 2.5-7, which can be written as 


Fx (x) = 5> Px (xiju(x — zi), (2.5-14) 


i=—00 


and using the results from the section on delta functions in Appendix B enables us to write 
for a discrete RV 
dFx(z ka 
fx(z) = Sxl) = D Px (z;)ô(x — z;). (2.5-15) 


i=— 00 


tSee Appendix B for a definition of the incomplete gamma. 
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0.65(x—1) 


0.28(x) 0.26(x— 3) 





(b) 


Figure 2.5-3 (a) CDF of a discrete RV X; (b) pdf of X using delta functions. 


Example 2.5-2 
(practice erample) Let X be a discrete RV with distribution function as shown in Figure 2.5- 
3(a). The pdf of X is 





fx(x) = ox = 0.26(z) + 0.66(z — 1) + 0.26(2 — 3) 


and is shown in Figure 2.5-3(b). To compute probabilities from the pdf for a discrete RV, 
great care must be used in choosing the interval of integration. Thus, 


Fx) = f ” hx (Ode, 


which includes the delta function at x if there is one there. 
Similarly Plz, < X < 2x2] involves the interval 


—{__- 


x X2 


and includes the impulse at z3 (if there is one there) but excludes what happens at zı. On 
the other hand Plz; < X < 2] involves the interval 


——_).— 


xy X2 


118 Chapter 2 Random Variables 
and therefore 
z2 
Plas <X <aa)= f fO, 
Ty 


Applied to the foregoing example, these formulas give 


P[X < 1.5] = Fx(1.5) = 0.8 (2.5-16) 
Pil < X <3) =0.2 (2.5-17) 
P[1 < X < 3] = 0.6. (2.5-18) 


Example 2.5-3 
(Practice example) The pdf associated with the Poisson law with parameter a is 


oo 


k 
fx(x) =e7? 2 ie —k). 


Example 2.5-4 
(Practice example) The pdf associated with the binomial law b(k; n, p) is 





n 


fx) =A (z) p*q”*6(x — k). 


k=0 


Example 2.5-5 eee a 
(Practice erample) The pdf of a mixed RV is shown in Figure 2.5-4. (1) What is the 
constant K? (2) Compute P[X < 5], P[5 < X < 10]. (3) Draw the distribution function. 


Solution (1) Since 


[tog 


we obtain 10K + 0.25 + 0.25 = 1 > K = 0.05. 
(2) Since P[X < 5] = P[X < 5] + P[X = 5], the impulse at z = 5 must be included. 
Hence 
5t 


PIX <5] = f [0.05 + 0.255(€ — 5)]d£ 


= 0.5. 


To compute P(5 < X < 10), we leave out the impulse at x = 10 but include the impulse at 
x = 5. Thus, 


107 
P|5 < X <10] = f (0.05 + 0.256(€ — 5)]dé 
o- 


= 0.5. 
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0.258(x — 5) 0.258(x — 10) 





(b) 
Figure 2.5-4 (a) pdf of a mixed RV for Example 2.5-5; (b) computed pdf. 


2.6 CONDITIONAL AND JOINT DISTRIBUTIONS AND DENSITIES 


Consider the event C consisting of all outcomes Ç € Q such that X(¢) < z and ÇE B CQ, 
where B is another event. Then, by definition, the event C is the set intersection of the two 
events {¢: X(C) < x} and {¢: Ç € B}. We define the conditional distribution function of X 
given the event B as 


a PIC] _ PIX < z,B] 
— PIB] PIB)’ 





Fx(z|B) (2.6-1) 
where P[X < z, B] is the probability of the joint event {X < z} N B and P[B] # 0. If 
z = œ, the event {X < œo} is the certain event Q and since Q N B = B, Fx(oo|B) = 1. 
Similarly, if x = —oo, {X < —co} = ¢ and since QN ¢ = ¢, Fx(—o0|B) = 0. Continuing in 
this fashion, it is not difficult to show that Fx(x|B) has all the properties of an ordinary 
distribution, that is, xı < £2 — Fy (zı|B) < Fx (x2|B). 

For example, consider the event {X < z2, B} and write (assuming x2 > 21) 


{X < z2, B} = {X < zı, B} U {z1 < X < 22, B}. 
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Since the two events on the right are disjoint, their probabilities add and we obtain 


P[X < 22, B] = P[X < z1, B] + Pla < X < z2, B] 


or 
P[X < z2|B]P[B] = P|X < zı|B]P[B] + P|zı < X < z2|B]P[B]. 
Thus when P[B] 4 0, we obtain after rearranging terms and dividing by B 


Pir, < X < z2|B] = P[X < z2|B] — PIX < zı|B] 
= Fx (x2|B) — Fx(zı|B). (2.6-2) 


Generally the event B will be expressed on the probability space (R,.@, Px) rather than 
the original space (Q, Z P). The conditional pdf is simply 


(2.6-3) 


Following are some examples. 


Example 2.6-1 
(evaluating conditional CDFs) Let B 2 {X < 10}. We wish to compute Fx (z|B). 


(i) For z > 10, the event {X < 10} is a subset of the event {X < x}. Hence P[X < 
10, X < z] = P[X < 10] and use of Equation 2.6-1 gives 


PIX <2,X <10] _ 
P[X<10) 





Fx(2|B) = 


(ii) For x < 10, the event {X < z} is a subset of the event {X < 10}. Hence P[X < 
10, X < z} = P[X < z] and 


PIX < q] 


The result is shown in Figure 2.6-1. We leave as an exercise to the reader to compute 
Fx(z|B) when B = {b < X <a}. 
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Fx(x| B) 





0 10 x 


Figure 2.6-1 Conditional and unconditional CDFs of X. 


Example 2.6-2 


(Poisson conditioned on even) Let X be a Poisson RV with parameter u (>0). We wish 


to compute the conditional PMF and CDF of X given the event {X = 0,2,4,...} £ 


{X (is) even}. First observe that P[X even] is given by 





co k 





— = Mee 
PIX =0,2,..J= S> ae” 
k=0,2,... 
Then for X odd, we have 
PIX =1,3,..]= $ ye 
k=1,3,... 
From these relations, we obtain 
k k OC ok 
B`- H` oe Hsk- 
D He D Be Shoe 
k>0 and even k>0 and odd k=0 
k 
_ > (=4)" o-u 
= “A 
k=0 
= eHe” 
= e7?” 
and 
pE pe 
~~ pet D Geta. 
k>0 and even k>0 and odd 
Hence P[X even] = P[X = 0,2,...] = $(1+e7?#). Using the definition of conditional PMF, 
we obtain 


P[X =k,X even] 


Px (k|X even) = PIX even] 
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If k is even, then {X = k} is a subset of {X even}. If k is odd, {X = k} N {X even} = ¢. 
Hence P[X = k, X even] = P[X = k] for k even and it equals 0 for k odd. So we have 


k — 
Px(k|X even) = $ F27 are") k > 0 and even, 
0, k odd. 


The conditional CDF is then 
Fx (2|X even) = 5 Px (k|X even) 


all k< 
- 5 — 2 Wu 
— { . 
okés (1 +2e-#) k! 
and even 


Let us next derive some important formulas involving conditional CDFs and pdf’s. 


The distribution function written as a weighted sum of conditional distribution func- 
tions. Equation 1.6-7 in Chapter 1 gave the probability of the event B in terms of n 
mutually exclusive and exhaustive events {4;}, i = 1,...,n, defined on the same probability 


space as B. With B 2 {X < z}, we immediately obtain from Equation 1.6-7: 


Fx (a) = $ Fx (2|4;)P[A;]. (2.6-4) 


t=1 


Equation 2.6-4 describes F(z) as a weighted sum of conditional distribution functions. 
One way to view Equation 2.6-4 is an “average” over all the conditional CDFs.t Since we 
haven’t yet made concrete the notion of average (this will be done in Chapter 4), we ask 
only that the reader accept the nomenclature since it is in use in the technical literature. 


Example 2.6-3 
(defective memory chips) In the automated manufacturing of computer memory chips, 
company Z produces one defective chip for every five good chips. The defective chips (DC) 
have a time of failure X that obeys the CDF 





Fx (z|DC) = (1 — e7*/*)u(x) (x in months) 
while the time of failure for the good chips (GC) obeys the CDF 
Fx (2|GC) = (1 — e~*/1°)u(x) (x in months). 


The chips are visually indistinguishable. A chip is purchased. What is the probability that 
the chip will fail before six months of use? 


tFor this reason, when Fx (x) is written as in Equation 2.6-4, it is sometimes called the average distri- 
bution function. 
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Solution The unconditional CDF for the chip is, from Equation 2.6-4, 
Fx (x) = Fx (2|DC)P[DC] + Fx (2|GO)P[GC}, 


where P[DC] and P[GC] are the probabilities of selecting a defective and good chip, respec- 
tively. From the given data P[DC] = 1/6 and P[GC] = 5/6. Thus, 


Fx (6) = [1 - e*)z + [1 — e7®6]5 
= 0.158 + 0.376 = 0.534. 





Bayes’ formula for probability density functions. Consider the events B and {X = z} 
defined on the same probability space. Then from the definition of conditional probability, 
it seems reasonable to write 


P[B, X =a] 


PIBIX = 2] = -p 


(2.6-5) 
The problem with Equation 2.6-5 is that if X is a continuous RV, then P[X = z] = 0. 
Hence Equation 2.6-5 is undefined. Nevertheless, we can compute P[B|X = z] by taking 
appropriate limits of probabilities involving the event {x < X < z+ Az}. Thus, consider 
the expression 


Pla < X < z+ Az|B]P[B] 


P[B|e < X < z + Az] = P|z < X < z+ Az] 


If we (i) divide numerator and denominator of the expression on the right by Ag, (ii) use 
the fact that Plz < X < z + Az|B] = F(x + Az|B) — F(zx|B), and (iii) take the limit as 
Az — 0, we obtain 


P|B|X = z| = lim P[B|r < X < z+ Az] 
Az—0 


_ fx(2|B)PIB] 
fx(z) ` 
The quantity on the left is sometimes called the a posteriori probability (or a posteriori 


density) of B given X = x. Multiplying both sides of Equation 2.6-6 by fx(z) and inte- 
grating enables us to obtain the important result 


f(x) £0. (2.6-6) 


PIB]= f PIBIX = 2ifx(a)de. (2.6-7) 


In line with the terminology used in this section, P[B] is sometimes called the average 
probability of B, the usage being suggested by the form of Equation 2.6-7. 
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Figure 2.6-2 Based upon observing the signal, the receiver R must decide which switch was closed 
or, equivalently, which of the sources A, B, C was responsible for the signal. Only one switch can be 
closed at the time the receiver is on. 








Example 2.6-4 
(detecting closed switch) A signal, X, can come from one of three different sources designated 
as A, B, or C. The signal from A is distributed as N(—1, 4); the signal from B is distributed 
as N (0, 1); and the signal from C has an N(1,4) distribution. In order for the signal to reach 
its destination at R, the switch in the line must be closed. Only one switch can be closed 
when the signal X is observed at R, but it is not known which switch it is. However, it is 
known that switch a is closed twice as often as switch b, which is closed twice as often as 
switch c (Figure 2.6-2). 





(a) Compute P[X < —1); 
(b) Given that we observe the event {X > —1}, from which source was this signal most 
likely? 


Solution (a) Let P[A] denote the probability that A is responsible for the observation 
at R, that is, switch a is closed. Likewise for P[B], P[C]. Then from the information about 
the switches we get P[A] = 2P[B] = 4P[C] and P[A]+ P[B]+ P[C] = 1. Hence P[A] = 4/7, 
P[B] = 2/7, PIC] = 1/7. Next we compute P[X < —1] from 


P[X < -1] = P[X < -1|A]P[A] + PIX < -1|B]P[B] + PIX < -1|C]P[C], 


where 
P[X < -1|A] = 1/2 (2.6-8) 
P[X < —1|B] = 1/2 — erf(1) = 0.159 (2.6-9) 
P[X < -1|C] = 1/2 — erf(1) = 0.159. (2.6-10) 


Hence P[X < —1] = 1/2 x 4/7 + 0.159 x 2/7 + 0.159 x 1/7 = 0.354. 

(b) We wish to compute max{P[A|X > —1], P[B|X > —1], P[C|X > —1]}. To enable 
this computation, we note that P[X > —1|A] = 1 — P[X < —1|A], and so on, for B and C. 
Concentrating on source A, and using Bayes’ rule, we get. 


{1 — P[X < -1|A]} x P[A] 


PIA|X > —1] = 1- PIX <-1] : 
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which, using the values already computed, yields P[A|X > —1] = 0.44. 
Repeating the calculation for the other sources, we obtain 


P{[B|X > —1] = 0.372, (2.6-11) 
P[C|X > —1] = 0.186. (2.6-12) 


Hence, since the maximum a posteriori probability favors A, source A was the most likely 
cause of the event {X > —1}. 





Poisson transform. An important specific example of Equation 2.6-7 is the so-called 
Poisson transform in which B is the event that a random variable Y takes on an integer 


value k from the set {0,1,...,} that is, B 4 {Y = k} and X is the Poisson parameter, 
treated here as a random variable with pdf fx (xz). The ordinary Poisson law 


u pe 

ane k>0, (2.6-13) 
where u is the average number of events in a given interval (time, distance, volume, and 
so forth), treats the parameter as a constant. But in many situations the underlying 
phenomenon that determines p is itself random and u must be viewed as a random outcome, 
that is, the outcome of a random experiment. Thus, there are two elements of randomness: 
the random value of u and the random outcome {Y = k}. When p is random it seems 
appropriate to replace it by the notation of a random variable, say X. Thus, for any given 
outcome {X = x} the probability P[Y = k|X = zx] is Poisson; but the unconditional 
probability of the event {Y = k} is not necessarily Poisson. Because both the number of 
events and the Poisson parameter are random, this situation is sometimes called doubly 
stochastic. From Equation 2.6-7 we obtain for the unconditional PMF of Y 


PlY = k] =e 


0° mk 
Py (k) = ae” fx(a)dz, © k>0. (2.6-14) 
o k 


The above Equation is known as the Poisson transform and can be used to obtain fx (zx) 
if Py (k) is obtained by experimentation. The mechanism by which fx (zx) is obtained from 
Py (k) is the inverse Poisson transform. The derivation of the latter is as follows. Let 


Fw) ê = f * dte- fy (a)dz, (2.6-15) 


that is, the inverse Fourier transform of e~* fx (x). Since 
, 00 
e”? — 5 [jwr]? /k!, (2.6-16) 
k=0 


we obtain 


F B 1 oo ; k © yk a d 
(o) = zp 2o) A HE fx(z)dz 
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1 Š. 
=z X jw* Py (k) (2.6-17) 
k=0 


Thus, F(w) is known if Py(k) is known, Taking the forward Fourier transforms of F(w) 
yields 


e fx(w) = e T F(wje dw. 


Jx(2) =e? f j F(w)e dw. (2.6-18) 


Equation 2.6-18 is the inverse relation we have been seeking. Thus to summarize: If we know 
Py (k), we can compute F (w). Knowing F (w) enables us to obtain fx (x) by a Fourier trans- 
form. We illustrate the Poisson transform with an application from optical communication 
theory. 


Example 2.6-5 — > 
(optical communications) In an optical communication system, light from the transmitter 
strikes a photodetector, which generates a photocurrent consisting of valence electrons 
having become conduction electrons (Figure 2.6-3). 

It is known from physics that if the transmitter uses coherent laser light of constant 
intensity the Poisson parameter X has pdf 


fx(z)=ô(z— zo)  zə>0, (2.6-19) 


where zo, except for a constant, is the laser intensity. On the other hand, if the transmitter 
uses thermal illumination, then the Poisson parameter X obeys the exponential law: 


fx(x) = tule), (2.6-20) 


where u > 0 is now just a parameter, but one that will later be shown to be the true mean 
value of X. Compute the PMF for the electron-count variable Y. 


Solution For coherent laser illumination we obtain from Equation 2.6-14 


CO mk 
Py(k) = geola — zo)dz (2.6-21) 
o k 
zE 
— 9 p` To 


Thus, for coherent laser illumination, the photoelectrons obey the Poisson law. For thermal 
illumination, we obtain 
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x 
Py(k) = —_e- te */Hdy 
(k) o WE 
1 oO 
=— gke-*/*dz, with a â E 
LE! Jo +1 
k—1 poo 
= H f zke-*dz, with zê z/a, 
-Jo 
gk} 
= “am T +1), where T denotes the Gamma function (see Appendix B), 
k-1 
=% _ 
pk! 
ak- 
pb 
k 
H 
= > . 
+ ae =e 


(2.6-23) 


This PMF law is known as the geometric distribution and is sometimes called Bose-Einstein 
statistics [2-4]. It obeys the interesting recurrence relation 
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Py(k+1) = EAO] (2.6-24) 


Depending on which illumination applies, the statistics of the photocurrents are widely 
dissimilar. 





Joint distributions and densities. As stated in Section 2.1, it is possible to define more 
than one random variable on a probability space. For example, consider a probability space 
(Q, ZP) involving an underlying experiment consisting of the simultaneous throwing of 
two fair coins. Here the ordering is not important and the only elementary outcomes are 
¢, =HH, ¢, =HT, C, =TT, the sample space is Q = {HH, HT, TT}, the o-field of events is 
$, 2, {HT}, {TT}, {HH}, {TT or HT}, {HH or HT}, and {HH or TT}. The probabilities 
are easily computed and are, respectively, 0, 1, 1/2, 1/4, 1/4, 3/4, 3/4, and 1/2. Now define 
two random variables 


0, if at least one H 


1, otherwise (2.6-25) 


_ JjJ-1, if one H and one T 
X2(Q) = oi otherwise. (2.6-26) 


Then P[X, = 0] = 3/4, P[X, = 1] = 1/4, P[X2 = —1] = 1/2, P[X2 = 1] = 1/2. Also 
we can easily compute the probability of joint events, for example, P[X, = 0, X2 = 1] = 
P|{HH}] = 1/4. 

In defining more than one random variable on a probability space, it is possible to define 
degenerate random variables. For example suppose the underlying experiment consists of 
observing the number ¢ that is pointed to when a spinning wheel, numbered 0 to 100, comes 
to rest. Suppose we let Xi(¢) = Ç and X2(¢) = eS. This situation is degenerate because 
observing one random variable completely specifies the other. In effect the uncertainty is 
associated with only one random variable, not both; we might as well forget about observing 
the other one. If we define more than one random variable on a probability space, degeneracy 
can be avoided if the underlying experiment is complex enough, or rich enough in outcomes. 
In the example we considered at the beginning, observing that X, = 0 doesn’t specify the 
value of X2 while observing X3 = 1 doesn’t specify the value of X1. 


The event {X < x, Y < y} 4 {X <a} N {Y < y} consists of all outcomes Ç € Q such 
that X(C) < z and YE) < y. The point set induced by the event {X < z,Y < y} is the 
shaded region in the z’y’ plane shown in Figure 2.6-4. In the diagram the numbers z, y 


are shown positive. In general they can have any value. The joint cumulative distribution 
function of X and Y is defined by 


Fxy(a,y) = PIX <2,Y <y]. (2.6-27) 
By definition Fxy(z,y) is a probability; thus it follows that Fyy(z,y) > 0 for all z, 


y. Since {X < 00, Y < ox} is the certain event, Fxy (00,00) = 1. The point set associated 
with the certain event is the whole z’y’ plane. The event {X < —oo,y < —oo} is the 


Sec. 2.6. CONDITIONAL AND JOINT DISTRIBUTIONS AND DENSITIES 129 








Vs 


' 


x 


Figure 2.6-4 Point set associated with the event {X < x, Y < y}. 


impossible event and therefore Fy y(—oo, —co) = 0. The reader should consider the events 
{X <2,Y < —oo} and {X < —00,Y < y}; are they impossible events also? 

Since {X < co} and {Y < oo} are certain events, and for any event B, BNQ = B, we 
obtain 


{X <2,Y < œ} ={X <a} N {Y < œ} 


={X<T} NQ 
={X<r} (2.6-28) 
so that 
Fyxy (z, 00) = Fx (x) (2.6-29a) 
Fxy(œ,y) = Fy (y) (2.6-29b) 


If Fxy (x,y) is continuous and differentiable, the joint pdf can be obtained from 


2 
fav (eu) = gog (Fev Ea) (2.6-30) 


It follows then, that 
fxy (a, y)da dy = Pin < X <a+dzr,y<Y < y+ dy] 
and hence that fxy (x,y) > 0 for all (z, y). 
By twice integrating Equation 2.6-30, we obtain 


Fev(oy) =f ” ae f  anfxy (En). (2.6-31) 
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Equation 2.6-31 says that Fxy(z, y) is the integral of the nonnegative function fxy(z, y) 
over the surface shown in Figure 2.6-4. It follows that integrating fxy(z,y) over a larger 
surface will generally yield a larger probability (never a smaller one!) than integrating 
over a smaller surface. From this we can deduce some obvious but important results. 
Thus, if (71,41) and (x%2,y2) denote two pairs of numbers and if zı < 22, Yı < ye, 
then Fxy(x1,y1) < Fxy(x2,y2). In general, Fxy(z,y) increases as (x,y) moves up and 
to the right and decreases as (x,y) moves down and to the left. Also Fxy is continuous 
from above and from the right, that is, at a point of discontinuity, say zo, yo, with €, 
ô >Q: 


Fxy (Zo, yo) = lim Fxy (ro +€, yo + ô). 


5—0 


Thus, at a point of discontinuity, Fxy assumes the value immediately to the right and 
above the point. 


Properties of joint CDF F xy (x, y) 
(i) Fxy(oo,co) = 1; Fxy(-00,y) = Fxy(z,—00) = 0; also Fxy(xr,00) = Fx(x); 

Fxy (oo, y) = Fy (y). 

(ii) If z1 < £2, y1 < yo, then Fxy (£1, y1) < Fxy (£2, Y2). 

(iii) Fxy(z,y) = lim F. xy (z+e,y+6) €,6 > 0 (continuity from the right and from 

. 6-0 

above). 

(iv) For all z2 > x, and y2 > yı, we must have 


Fyxy (22, y2) — Fxy(t2,y1) — Fxy (z1, Y2) + Fxy (z1, y1) > 0. 


This last and key property (iv) is a two-dimensional generalization of the nondecreasing 
property for one-dimensional CDFs, that is, Fx (x2) — Fx (zi) > 0 for all z2 > zı. It arises 
out of the need for the event {z1 < X <z2,y1 < Y < y2} to have nonnegative probability. 
The point set induced by this event is shown in Figure 2.6-5. 

The key to this computation is to observe that the set {X <z2,Y < y2} lends itself to 
the following decomposition into disjoint sets: 


{X < 223, Y < yo} = {21 < X < 22,41 < Y < y2} 
U {z1 < X < z2, Y <y} U {X < z1,y < Y < yo} 
U {X <z, Y < yy}. (2.6-32) 
Now using the induced result from Axiom 3 (Equation 1.5-3), we obtain 
Fxy(z2, ye) = Piri < X < z2,y1 < Y < y2] 
+ Play < X < z2,Y < yı] + P[X < z1,y1ı <Y < y3] 
+ Fxy (z1, y1). (2.6-33) 
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Figure 2.6-5 Point set for the event {x < X< x,y < Y< yo}. 


According to the elementary properties of the definite integral, the second and third terms 
on the right-hand side of Equation 2.6-33 can be written, respectively, as 


[ [iremen f" f" terenn 


7 T. f i fxy (E,n)dé dn (2.6-34) 
SS] teeemaean= f" f” fxr€ ndean 
7 L. L. Fxy (€,n)d€ dn. (2.6-35) 


But the terms on the right-hand sides of these equations are all distributions; thus, Equations 
2.6-34 and 2.6-35 become 


T2 yı 
ST f" terE ddn = Fee (ean) - Fer (enn), (2.6-36) 
Tı y2 
J fxy (€,n)d& dn = Fxy (z1, ye) — Fxy (£1, y1). (2.6-37) 
Teo v yı 


Now going back to Equation 2.6-33 and using Equations 2.6-36 and 2.6-37 we find that 
Fxy (22, y2) = Pilar < X < z2, < Y < yw] 
+ Fxy(22,y1) — Fxy (21,91) + Fxy (21, ye) — Fxy (21,41) 
+ Fxy (z1, 41). (2.6-38) 


After simplifying and rearranging term so that the desired quantity appears on the left-hand 
side, we finally get 


Plz, < X < z211 < Y < yo] = Fxy (£2, y2) — Fxy (z2, y1) 
— Fxy(x1, y2) + Fxy (21,91). (2.6-39) 
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Equation 2.6-39 is generally true for any random variables X, Y independent or not. 
Some caution must be taken in applying Equation 2.6-39. For example, Figure 2.6-6(a,b) 
and (b) show two regions A, B involving excursions on random variables X, Y such 
that {zı < X <2} and {yı <Y < y2}. However, the use of Equation 2.6-39 would not be 
appropriate here since neither region is a rectangle with sides parallel to the axes. In the 
case of the event shown in Figure 2.6-6(a), a rotational coordinate transformation might 
save the day but this would involve some knowledge of transformation of random variables, 
a subject covered in the next chapter. The events whose point sets are shown in Figure 
2.6-6 can still be computed by integration of the probability density function (pdf) provided 
that the integration is done over the appropriate region. We illustrate with the following 
example. 


Example 2.6-6 
(probabilities for nonrectangular sets) We are given fxy (x,y) = e~ +) u(x)u(y) and wish 
to compute P[(X, Y) € Æ], where Æ is the shaded region shown in Figure 2.6-7. The region 
Æ is described by .4 = {(z,y): 0 < z < 1, |y| < x}. We obtain 





y' 





xy X2 
(a) Point set for A (b) Point set for B 


Figure 2.6-6 Points sets of events A and B whose probabilities are not given by Equation 2.6-39. 





Figure 2.6-7 The region Æ for Example 2.6-5. 
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x=i 


Pixy f f ee u(au(yyde dy 


xz=0 


_ I ! ( f . e*u(y)dy ) e-Pu(z)dx 
Lf (fers 

= ['(-ev[pertae 

-f (1 —e7®)e"*dz 


1 . 
=f (e™® — e~? )dzr 
0 





1 
—;,_,-1_+, +,-2 
=l-e 3 + 9° 
_ 1 a l 
=3 e` + zE 
= 0.1998. (2.6-40) 


Example 2.6-7 —— eee 
(computing CDF) Let X, Y be two random variables with joint pdf fxy(z,y) = 1 for 
0<2<1,0< y< 1, and zero elsewhere. The support for the pdf is shown in gray; the 
support for the event (—oo, x] x (—0o, y] for values 0 < z < 1,0 < y < 1 is shown bounded 
by the heavy black line. 






O<x< 1,0< y<1 


For the situation shown in the figure Fxy (x,y) = Jy fo 1de’ dy’ = zy. 


When 0 < x < 1, y > 1, we obtain Fxy(z,y) = So dz’ fo dy = x. Proceeding in this way, 
we eventually obtain a complete characterization of the CDF as 
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O<x<1y>1 


0, #<0, ory<0, 
zy, 0<zr<1,0<y<l1, 


Fxy(z,y) = T, 0<z<l,y>l, 
y, z>l,0<y<1, 
1, z>tl,y>l. 





As Examples 2.6-6 and 2.6-7 illustrate for specific cases, the probability of any event of 
the form {(X,Y) € Æ} can be computed by the formula 


P(X, Y) € 4 = J T, fav (a, y)de dy (2.6-41) 


provided fxy({z,y) exists. While Equation 2.6-41 seems entirely reasonable, its veracity 
requires demonstration. One way to do this is to decompose the arbitrarily shaped region 
into a (possibly very large) number of tiny disjoint rectangular regions 14), 42,...,-6N- 
Then the event {X,Y € Æ} is decomposed as 


N 
{(X,Y) € 4 = [JX Y) €-4} 


i=1 


with the consequence that (by induced Axiom 3) 


N 
P(X, Y) € 4 =X P(X, Y) € A]. (2.6-42) 
i=1 
But the probabilities on the right-hand side can be expressed in terms of distributions and 
hence in terms of integrals of densities (Equation 2.6-39). Then, taking the limit as N 
becomes large and the 4; become infinitesimal, we would obtain Equation 2.6-41. 


The functions Fx(z) and Fy (y) are called marginal distributions if they are derived 
from a joint distribution. Thus, 


Fx(2) = Fev(e,00) = f f ” fev l€,y)dédy (2.6-43) 


Fy (y) = Fxy(co,y) = T [ fxy (x, n)dxdn. (2.6-44) 
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Since the marginal densities are given by 


fx(z) = xla) (2.6-45) 
fy(y) = syw, (2.6-46) 


we obtain the following by partial differentiation of Equation 2.6-43 with respect to x and 
of Equation 2.6-44 with respect to y: 


-f ~ fey (e,y)dy (2.6-47) 


fry) = f fxy(z,y)dz. : (2.6-48) 
We next summarize the key properties of the joint pdf fxy (z, y). 


Properties of Joint pdf's. 
(i) fxy(z,y) >0 for all z, y. 


o0 foo) 
(ii) J / fxy (a, y)dz dy =1 (the certain event). 


(iii) While fxy(z, y) is not a probability, indeed it can be greater than 1, we can regard 
fxy (2, y)dx dy as a differential probability. We will sometimes write fxy(z,y) 
dx dy = P|z < X < x +dz,y < Y nae 


(iv) fx(z) = [n fxy(2,y)dy and fy(y) = [S fxy (a, y)de. 


Property (i) follows from the fact that M integral of the joint pdf over any region of 
the plane must be positive. Also, considering this joint pdf as the mixed partial derivative 
of the CDF, property (i) easily follows from a limiting operation applied to property (iv) 
of the joint CDF. Property (ii) follows from the fact that the integral of the joint pdf over 
the whole plane gives us the probability that the random variables will take on some value, 
which is the certain event with probability 1. 

For discrete random variables we obtain similar results. Given the joint PMF Pyy (£i, Yk) 
for all zi, yk, we compute the marginal PMF’s from 


Px (2) = D Pyy (£i yk) (2.6-49) 
all yk 

= > Pyy (£i, Yk). (2.6-50) 
all a; 


Example 2.6-8 — 
(waiting time at a restaurant) A certain restaurant has been found to have the following 
joint distribution for the waiting time for service for a newly arriving customer and the total 
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For6<n<10 





NS Hotin weighted sum 


0 w 


Figure 2.6-8 The CDF of Example 2.6-8: (top) number of customers in the range 1 to 5; (bottom) 
number of customers in the range 6 to 10. 


number of customers including the new arrival. Let W be a random variable representing 
the continuous waiting time for a newly arriving customer, and let N be a discrete random 
variable representing the total number of customers. 

The joint distribution function is then given as, 


0, n<Oorw <0, 
(1—e7v/Ho) 2, 0<n<5,w2>0, 
(1 —e7¥/H0) 5 4 (1 —e-W/1) (28) 5 <n<10,w>0,’ 
(1 —e-W/Ho) 3 + (1-— e7) (2), 10<n,w>0 


where the parameters ju, satisfy 0 < uo < pt. Note that this choice of the parameters means 
that waiting times are longer when the number of customers is large. 

Noting that W is continuous and N is discrete, we sketch this joint distribution as a 
function of w ‘for several values of n for n > 0 and w > 0 in Figure 2.6-8. 

We next find the joint mixed probability density-mass function 


Fw,n(w, n) = 


ð 
fw.n(w,n) Dy Vr win (w, n) 


= Z {Fw n (w,n) _ Fyw,n(w,n ~ 1)} 


ð ð 
= By fwn (wn) — Jy fwn wn — 1). 
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In words fw.n(w,n) is the pdf of W together or jointly with {N = n}. Calculating, we 
obtain 


(l—ew/Ho) 2, O<n<5, 
VnFw.n(w,n) = Fww(w,n) — Fw,w(w,n-1) = u(w) 4 (1 -et r), 5 <n <10, 
0, else. 


Therefore, 


ro] 
fw.w(w,n) = Bw Vr Ewn (wn) 


11- 

T0 pp © w/o | O<n<5, 
= u(w) one, 5<n< 10, 

0, else. 


Thus we see a simpler view in terms of the joint pdf, where the shorter average waiting 
time 4o governs the RV W when there are less than or equal to n = 5 customers, while the 
longer average waiting time u governs when there are more than 5 customers. In a more 
detailed model, the average waiting time would be expected to increase with each increase 
in n. 





Independent random variables. Two RVs X and Y are said to be independent if the 
events {X < x} and {Y < y} are independent for every combination of z, y. In Section 1.5 


two events A and B were said to be independent if P[AB] = P[A]P[B]. Taking AB 4 {X< 
z} N {Y < y}, where A 4 {X < z}, B = {Y < y}, and recalling that F(z) £ PIX < a], 
and so forth for Fy (y), it then follows immediately that 





Fxy (x,y) = Fx (x) Fy(y) (2.6-51) 
for every x, y if and only if X and Y are independent. Also 
PF xy (x,y) 
fxy(z,y) = rəy (2.6-52) 
_ 3Fx(z) ƏFy(y) 
ðr Oy 
= fx(z) fy (y). (2.6-53) 
From the definition of conditional probability we obtain for independent X, Y: 
. Fxy (a, y) 
Fx(al¥ < y) = 
= F(z), (2.6-54) 


and so forth, for Fy(y|X < x). From these results it follows (by differentiation) that for 
independent events the conditional pdf’s are equal to the marginal pdf’s, that is, 


fx(2l¥ < y) = fx(z) (2.6-55) 
fy(ylX < z) = fy(y). (2.6-56) 
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It is easy to show from Equation 2.6-39 that the events {z1 < X < 22} and 
{yı < Y < yo} are independent if X and Y are independent random variables, that is, 


Plz, <X< T2,y1 < Y < y2] = Play <X< x2|Piy <Y< y2] (2.6-57) 
if Fxy (x,y) = Fx (x) Fy (y). Indeed, using Equation 2.6-39 


Plt, < X < 22,91 < Y < yol 


= Fxy(22,y2) — Fxy (22,91) — Fxy (21, y2) + Fxy (21,91) (2.6-58) 
= Fx (x2)Fy (y2) — Fx (42) Fy (y1) — Fx (21) Fy (y2) + Fx(21) Fy (yi) (2-6-59) 
= (Fx (z2) — Fx(z1))(Fy (y2) — Fy(y1)) (2.6-60) 
= Pla, < X < 22)Ply. < Y < y2]. (2.6-61) 


Example 2.6-9 SSS 
The experiment consists of throwing a fair die once. The sample space for the experiment 
is Q = {1, 2,3, 4,5,6}. We define two RVs as follows: 


x(0) A 1+ ¢,for outcomes ¢ = 1 or 3 
0, for all other values of ¢ 

¥(0) aji- Ç, for outcomes ¢ = 1,2,or 3 

0, for all other values of ¢ 


(a) Compute the relevant single and joint PMFs. 
(b) Compute the joint CDFs Fxy (1, 1),Fxy (3, —0.5), Fxy (5, -1.5). 
(c) Are the RVs X and Y are independent? 


Solution Since the die is assumed fair, each face has a probability of 1/6 of showing up. 
(a) So the singleton events {Ç} are all equally likely probability P[{¢}] = 1/6. Thus, we 


obtain 
X(1) = 2, X(3) = 4, and for the other outcomes, we have 


X(2) = X(4) = X(5) = X(6) =0. 


Thus, the PMF Px is given as Px (0) = 4/6, Px (2) = 1/6, Px (4) = 1/6, and Px (k) =0 
for all other k. 
Likewise, from the definition of Y(¢), we obtain 


Y(1) = Y(4) = Y(5) = Y(6) =0, 
Y(2) = -1,and Y(3) = —2, 


thus yielding PMF values Py(0) = 4/6, Py (—1) = 1/6, Py (—2) = 1/6, and Py(k) = 0 for 
all other k. 

We next compute the joint PMFs Pxy(i,j) directly from the definition, that is, 
Pxy(i,j) = Plall ¢ : X(¢) = i, Y(¢) = j]. This is easily done if we recall that joint 
probabilities are probabilities of intersections of subsets of Q and for example, the event of 
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observing the die faces of 2, 4, 5, or 6 is written as the subset {2, 4, 5,6}. Thus, Pxy (0,0) = 
Plall ¢ : X(¢) = 0, Y(¢) = 0] = Pl{2, 4,5, 6} N {1, 4, 5, 6}] = P[{4, 5, 6}]=1/2. 
Likewise we compute: 


Pxy (2,0) = P[{1} N {1,4, 5, 6}] = PYH =1/6 
Pxy (4, 0) = P({3} N {1,4,5,6} = Pig] =0 

Pxy (0, —1) = P[{2, 4,5,6} N {2}] = P[{2}] = 1/6 
Pxy(2,—1) = P[{1} 9 {2}] = Pld] = 0 

Pxy (4, —1) = P[{3} 9 {2}] = Plġ] =0 

Pxy (0, —2) = P[{2, 4, 5, 6} N {3} = PI¢| =0 
Pxy (2, —2) = P[{1} N {3}] = Pig] =0 

Pxy (4, —2) = P[{3} 0 {3}] = P[{3}] = 1/6. 


(b) For computing the joint CDFs, it is helpful to graph these points and their associated 
probabilities. These probabilities are shown in parentheses. From the graph we see that 


Fxy (1, 1) = Pxy (0, 0) + Pxy (0, 1) + Pyy (0, 2) = 2/3. 


Likewise Fxy(3,—0.5) = Pxy (0, —1) + Pxy (2, —1) + Pxy(0,—2) + Pxy (2, —2) = 2 and 
Fxy (5, —1.5) = Pxy (0, —2) + Pxy (2, —2) + Pxy (4, -2) = ż. 

(c) To check for dependence, it is sufficient to find one point where the pdf (or CDF) 
does not factor. Consider then Pyy (2,0) = 1/6, but Px(2)Py (0) = 1/6 x 4/6 = 1/9, so the 
random variables X and Y are not independent. 





Probabilities associated with the Example 2.6-9. 


Example 2.6-10 
(joint pdf of independent Gaussians) 





fxy (x,y) = Qna2 oe (1/207)(z* +9") 
= = =e" 3 (2/0").. = se 8/0"). (2.6-62) 
Jono Vro 


Hence X and Y are independent RVs. 
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Probabilities associated with the example 2.6-9. The numbers in parenthesis are the 
probabilities of reaching those points. For example, Pxy (0,0) = 1/2. 


Example 2.6-11 
(calculations with independent Gaussians) The joint pdf of two random variables is given 
by fxy (x,y) = [27]! exp[—$(z? + y*)] for —o0 < x, y < oo. Compute the probability that 
both X and Y are restricted to (a) the 2 x 2 square; and (b) the unit circle. 





Solution (a) Let Rı denote the surface of the square. Then 


PE: (GY) eR] = ff fey(ea)dedy (2.6-63) 
1 

=e | œ|- | dex ef exp |= 50" | ay (2.6-64) 

= 2erf(1) x 2erf(1) = 0.465. (2.6-65) 





(b) Let Rz denote the surface of the unit circle. Then 


PIC: (X,Y) € Ro] = J T fxy (z, y)dzdy (2.6-66) 


- J [ (onl exp -10 +y?) dzdy. (2.6-67) 





With the substitution r 4 fx? + y? and tan 2 y/z, the infinitesimal area drdy — rdrdð, 
and we obtain 
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Pi¢: (X,Y) € Re] = = If exp (-3”) rdr dO (2.6-68) 
Qn Ro 2 
1 2a 1 1 
= — |f r exp (-37) ar| do (2.6-69) 
20 0 0 2 
1 1 
= J T exp (-37) dr (2.6-70) 
0 2 
1/2 
= f e 7dz, with zê Lia (2.6-71) 
o 2 
= 0.393. (2.6-72) 





Joint densities involving nonindependent RVs. Lest the reader think that al] joint 
CDF'’s or pdf’s factor, we next consider a case involving nonindependent random variables. 


Example 2.6-12 
(computing joint CDF) Consider the simple but nonfactorable joint pdf 








fxy(z,y) = A(z +y) 0<2<1, 0<y<l, (2.6-73) 
=0, otherwise, (2.6-74) 
and answer the following questions. 


(i) What is A? We know that 


Jii L fxy(z,y)dz dy = 1. 


1 1 1 1 
af ay | zdr+a f az f ydy=1>A=1. 
0 0 0 0 


(ii) What are the marginal pdf’s? 


Hence 


1 


co 1 
fx(z) = J Ixy (x, y)dy = f (x + y)dy = (£y + y?/2) 


o 
_ {3 +3, 0<2<1, 
0, otherwise. 


Similarly, 


fru) = T fxy (x, y)dz 


- {8ta 0<y<1, 
0, otherwise. 


142 


Chapter 2 Random Variables 





(iii) What is Fxy(z,y)? Fxy(z,y) £ P[X < z,Y < y], so we must integrate over the 


(a) 


(b) 


(c) 


(f) 


infinite rectangle with vertices (x, y), (z, —co), (—00, —00), and (—oo, y). However, 
only where this rectangle actually overlaps with the region over which fxy(z,y) # 
0, that is, the support of the pdf written supp(f) will there be a contribution to 
the integral 


T yY 
Fxyley)= | ax f dy fxv(e',v)) 
-00 -00 
z > 1, y > 1 [Figure 2.6-9(a)] 
1 1 
Fev(e.y) = f J fxy(2’,y’)dz' dy’ =1. 
0 0 


0<2<1,y> 1 [Figure 2.6-9(b)] 


1 T 
Farle) = f (f wea) 
y’=0 x'=0 
1 x 1 £ 
= f dy' (/ x’ az) +f dy’ y' (/ as’) 
y’=0 a'=0 y’=0 a«'=0 
= za +1). 
0 <y <1, x > 1 [Figure 2.6-9(c)] 
Y 1 y 
Fev) = [f(a yao! dy = by +0), 
y'=0 J x'=0 2 
0<z<1,0<y <1 [Figure 2.6-9(d)] 
y z yz 
Fxy(z,y) = J f (x! + y')dz' dy’ = F (£ +y). 
y'=0 Jz'=0 2 
x <0, for any y; or y < 0, for any x [Figure 2.6-9(e)] 


Fxy (a, y) = 0. 


Compute P[X +Y < 1]. The point set is the half-space separated by the line 
z+y= lor y= 1- zx. However, only where this half-space intersects the region 
over which fxy(z, y) Æ 0, will there be a contribution to the integral 


PIX+Y <1]= I fxy(z',y')dz' dy’. 
z’ +y' <l 
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y' 





1 
(x,y) 


' | 0 1 x! 
(e 


) 





Figure 2.6-9 Shaded region in (a) to (e) is the intersection of supp(fxy) with the point set associa- 
ted with the event {—co < X < x,-0o < Y < y}. In (f), the shaded region is the intersection of 
supp(fxy) with {X+ Y < 1}. 


[See Figure 2.6-9(f).] Hence 


1—0 1—0 


1 1 — ml\2 
=f a(i -a'ar + f azr) dz’ 
i 
3 


1 1-2’ 
P[IX+Y <1] =f J (x + y')dy' da’ 
z'=0 Jy 


1—0 z'=—0 2 





In the previous example we dealt with a pdf that was not factorable. Another example 
of a joint pdf that is not factorable is 


fxy(z,y) = (£? +y? — 2p2y)) (2.6-75) 


1 —1 
ey ( —_ 
2ra? /1 — p? P (za — p?) 
when p # 0. In this case, X and Y are not independent. 

In the special case when p = 0 in Equation 2.6-75, fxy(x,y) factors as fx(zx) fy (y) 
and X and Y become independent random variables. A picture of fxy(z,y) under these 
circumstances is shown in Figure 2.6-10 for o = 1. 
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Figure 2.6-10 Graph of the joint Gaussian density 


fxy (z, y) = (2r)! exp |-5@ + w) . 


As we shall see in Chapter 4, Equation 2.6-75 is a special case of the jointly Gaussian 
probability density of two RVs. We defer a fuller discussion of this important pdf until 
we discuss the meaning of the parameter p. This we do in Chapter 4. We shall see in 
Chapter 5 that Equation 2.6-75 and its generalization can be written compactly in matrix 
form. 


Example 2.6-13 
(calculation with dependent Gaussian RVs) Consider again the problem considered in 
Example 2.6-11, part (b), except let 





fxy (x,y) = [2r V1 — p?]~* exp (2? + y? - 200) - 


1 
20-7) 
As before let Ry denote the surface of the unit circle. Then 


Pi: (X,Y) € R= f T, fxy (z, y)dzdy (2.6-76) 


= J. [2704/1 — p?]7} exp ( — l (r +% — 2pcy) dz dy.(2.6-77) 


~ -2(1 — p?) 
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With the polar coordinate substitution r 4 fa? + y?, tand â y/x, we obtain 

1 2r 1 1 2 2 - 
Pi: (X,Y) ER -—— | f rex -aat — 2pr cos sin6) ) drad 
IC ( ) 2] On Tp Tp o 0 p 2(1 T ) 


2r 1 
= = | f rexp (—r? [2K?°(1 — psin20)])drdð with 
o Jo 


K& + and sin20 = 2sinf cosð, 
2/1 — p? 
K 27 1 2 . , A 2 
= — exp — ([2K?°(1 — psin26)] z) dz| dð with =z =r’, 
2r Jo 0 
K f” 1 


= 5 —l 1 — _ 24 : 
Qa Jo IKI — psin do) exp — [2K?(1 — psin 26)] d8 


o 1 ia 1 — exp[—2K?(1 — psin 26)] do 
— AnK Jo 1 — psin 20 ' 


For p = 0, we get the probability 0.393, that is, the same as in Example 2.6-10. However, 
when p Æ 0, this probability must be computed numerically since this integral is not avail- 
able in closed form. A MATLAB.m file that enables the computation of P(¢: (X,Y) € Rz) 
is furnished below. The result is shown in Figure 2.6-11. 


Matias.m file for computing. P[¢: (X,Y) € Re] 


function[Pr]=corrprob 
p=[0:100]/100. ; 
q=p*2*pi ; 
Pr=zeros(1,100); 


K=.5./sqrt(1-p.*2); 


for i=1:100 
f=(1-exp(-2*K (i) 72*(1-p(i) *sin(2*q))))./(1-p(i) *sin(2*q)); 
Pr(i)=sum(f£) /(4*pi) /K(i) *(2*pi/100) ; 

end 
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Probability that two correlated Gaussian RVs take values in the unit circle 
0.56 


0.54 
0.52 

0.5 
0.48 
0.46 
0.44 


0.42 


Probability that (X, Y) lie in the unit circle 


0.4 


0.38 
0 0.1 02 03 04 05 06 07 08 09 1 


Correlation coefficient p 


Figure 2.6-11 Result of MATLAB computation in Example 2.6-13. 


plot(p(1:100),Pr) 

title(‘Probability that two correlated Gaussian RVs take values in the 
unit circle’) 

xlabel (‘Correlation coefficient rho’) 

ylabel(‘Probability that X,Y) lie in the unit circle’) 

In Section 4.3 of Chapter 4 we demonstrate the fact that as p — 1 


fxy(z,y) > l eTil — x). Hence 


Jon 
Pie : (X,Y) € R] = J T, = (2.6-78) 


0.707 1 


e7052 (g — y)dxdy = J 0.527 qy (2.6-79) 


——e 
—0.707 V 2T 

= 0.52. (2.6-80) 
This is the result that we observe in Figure 2.6-11. 





Conditional densities. We shall now derive a useful formula for conditional densities 
involving two RVs. The formula is based on the definition of conditional probability given 
in Equation 1.6-2. From Equation 2.6-39 we obtain 
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Pla<X<a2+Aay<Y <y+Ay) 
= Fyy(z + Az,y + Ay) — Fxy(z,y + Ay) — Fxy(x+ Az, y) + Fxy (z, y).(2.6-81) 


Now dividing both sides by Az Ay, taking limits, and subsequently recognizing that the 
right-hand side, by definition, is the second partial derivative of F'xy with respect to x and 
y enables us to write that 


Pla< X <a+Az,y<Y¥<y+Ay] 6 Fxy a 





Hence for Az, Ay small 
Ple< X <2+Az,y<Y <y + Ay] = fxy(c,y)Az Ay, (2.6-82) 


which is the two-dimensional equivalent of Equation 2.4-6. Now consider 
Pla<X<2+Ar,y<Y <y+ ây] 
P|z < X < z+ Az] 


~ fxy (z, y)Aa Ay 
~ fx(z)Az 


P[y < Y <yt+Ayle < X < z + Ar] = 





(2.6-83) 


But the quantity on the left is merely 
Fy ply + Ayla < X < z + Az) — Fyjg(yle < X < z + Az), 


where B ê {a < X < z + Az}. Hence 


lim Fy|a(y + Aylz < X < z + Az) — Fyjp(yle < X < z + Az) 
Ar—0 Ay 


Ayo 





_ fxy(z,y) 


fx (2) 
_ OF xy (y|X = 2) 
- oa 
= fy|x(ylz) (2.6-84) 


by Equation 2.6-3. The notation fy|x(y|z) reminds us that it is the conditional pdf of Y 
given the event {X = z}. We thus obtain the important formula 


frix (ylz) = ee fx(x) £0. (2.6-85) 


If we use Equation 2.6-85 in Equation 2.6-48 we obtain the useful formula: 


fr= f ~ fix (ule) fx(2)de. (2.6-86) 
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Also 


fxr (ely) = Drey, fy(u) £0. (2.6-87) 


The quantity fx)y(zly) is called the conditional pdf of X given the event {Y = y}. From 
Equations 2.6-85 and 2.6-86 it follows that 


fxr (zly) = Soei, (2.6-88) 


We illustrate with an example. 





Example 2.6-14 
(laser coherence) Suppose we observe the light field U (t) being emitted from a laser. Laser 
light is said to be temporally coherent, which means that the light at any two times tı and 
tg is statistically dependent if t2 — tı is not too large [2-5]. Let X 2u (4), Y ĉy (t2) and 
t2 > tı. Suppose X and Y are modeled as jointly Gaussian? as in Equation 2.6-75 with 
a? = 1. For p Æ 0, they are dependent, it turns out that using the defining Equations 2.6-47 
and 2.6-48, one can show the marginal densities fx(x) and fy (y) are individually Gaussian. 
We defer the proof of this to Chapter 4. Since the means are both zero here and the variances 
are both one, we get for the marginal densities 





e7? and fy(y) = eit, 


1 
g) = 
f X ( ) Jon 
both are centered about zero. Now suppose that we measure the light at tı, that is, X and 
find that X = z. Is the pdf of Y, conditioned upon this new knowledge, still centered at 
zero, that is, is the average? value of Y still zero? 


Solution We wish to compute fy|x (y|z). 
Applying Equation 2.6-85, 


fy|x(y|z) = fae) 


tLight is often modeled by Poisson distribution due to its photon nature. As seen in Chapter 1, for a 
large photon count, the Gaussian distribution well approximates the Poisson distribution. Of course light 
intensity cannot be negative, but if the mean is large compared to the standard deviation (p >> ø), then 
the Gaussian density will be very small there. 

tA concept to be fully developed in Chapter 4. 
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If we multiply and divide the isolated 52? term in the far right of the exponent by 1 — p°, 
we simplify the above as , 


2 2_ — 21 — 9? 
frix(yle) = Tm {- [ete fey A] \. 


Further simplifications result when quadratic terms in the exponent are combined into a 
perfect square: 


1 (y —- e) 
z) = ——— e -4 ]. 
fyix (ylz) ETA Xp ( 21 — p” 
Thus, when X = zg, the pdf of Y is centered at y = px and not zero as previously. If px > 0, 
Y is more likely to take on positive values, and if px < 0, Y is more likely to take on negative 


values. This is in contrast to what happens when X is not observed: The most likely value 
of Y is then zero! 





A major application of conditioned events and conditional probabilities occurs in the 
science of estimating failure rates. This is discussed in the next section. 


2.7 FAILURE RATES 


In modern industrialized society where planning for equipment replacement, issuance of life 
insurance, and so on are important activities, there is a need to keep careful records of 
the failure rates of objects, be they machines or humans. For example consider the cost of 
life insurance: Clearly it wouldn’t make much economic sense to price a five-year term-life 
insurance policy for a 25-year-old woman at the same level as, say, for a 75-year-old man. 
The “failure” probability (i.e., death) for the older man is much higher than for the young 
woman. Hence, sound pricing policy will require the insurance company to insure the older 
man at a higher price. How much higher? This is determined from actuarial tables which 
are estimates of life expectancy conditioned on many factors. One important condition is 
“that you have survived until (a certain age).” In other words, the probability that you 
will survive to age 86, given that you have survived to age 85, is much higher than the 
probability that you will survive to age 86 if you are an infant. 

Let X denote the time of failure or, equivalently, the failure time. Then by Bayes’ 
theorem, the probability that failure will occur in the interval [t,t + dt] given that the 
object has survived to ¢ can be written as 


Plt< X <t+dt,X >t] 
PIX >t] 

But since the event {X > t} is subsumed by the event {t < X < t + dt}, it follows that 

P|t < X <t+dt,X > t| = P[t < X < t+ dt]. Hence 


Plt < X <t+ di] 
P|X >t] 


Plt< X <t+dt|X >t] = 





(2.7-1) 


Plt< X <t+dt|X >t]= (2.7-2) 
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By recalling that Plt < X < t + dt] = Fx(t+ dt) — Fx(t), we obtain 


Fx (t + dt) — Fx(t) 


Plt < X <t+di|X >t]= i Fx) 
Trx 


(2.7-3) 


A Taylor series expansion of the CDF Fy (t + dt) about the point t yields (we assume that 
Fy is differentiable) 


Fy (t + dt) = Fx(t) + fx(t) dt. 
When this result is used in Equation 2.7-3, we obtain at last 


Plt < X <t+dt|X >t] = eS (2.7-4) 


2 a(t) dt, 


where 
(2.7-5) 


The object a(t) is called the conditional failure rate although it has other names such 
as the hazard rate, force of mortality, intensity rate, instantaneous failure rate, or simply 
failure rate. If the conditional failure rate at t is large, then an object surviving to time t 
will have a higher probability of failure in the next At seconds than another object with 
lower conditional failure rate. Many objects, including humans, have failure rates that vary 
with time. During the early life of the object, failure rates may be high due to inherent or 
congenital defects. After this early period, the object enjoys a useful life characterized by 
a near-constant failure rate. Finally, as the object ages and parts wear out, the failure rate 
increases sharply leading quickly and inexorably to failure or death. 

The pdf of the random variable X can be computed explicitly from Equation 2.7-3 
when we observe that Fy (t+ dt) — Fx(t) = Fy (t)dt = dFx. Thus, we get 


dFx 
1— Fy 





= a(t) dt, (2.7-6) 
which can be solved by integration. First recall from calculus that 


f dy _ -f dy _ f” dy -mn 1 — Yo 
Yo 1— y 1—yo y l—yı y 1— yı 


Second, use the facts that 


(i) Fx(0) = 0 since we assume that the object is working at t = 0 (the time that the 
object is turned on); 
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(ii) Fx (oo) = 1 since we assume that the object must ultimately fail. Then 


Fx (t) t 
f dix nfl — Fx(t)] = f a(t’)dt’, 
Fx(0) 1- Fx 0 





from which we finally obtain 
Fx(t) =1—exp |- f ' atjar | (2.7-7) 
Since Fx (oo) = 1 we must have 
f * a(t’) dt! = oo. (2.7-8) 


Equation 2.7-7 is the CDF for the failure time X. By differentiating Equation 2.7-7, we 
obtain the pdf 


f(t) = a(t) exp |- f atear] . (2.7-9) 


Different pdf’s result from different models for the conditional failure rate a(t). 





Example 2.7-1 
(conditional failure rate for the exponential case) Assume that X obeys the exponential 
probability law, that is, Fx (t) = (1 — e~**)u(t). We find 


À —At 
alt) = No = Sw => 


Thus, the conditional failure rate is a constant. Conversely, if a(t) is a constant, the failure 
time obeys the exponential probability law. 


An important point to observe is that the conditional failure rate is not a pdf (see 
Equation 2.7-8). The conditional density of X, given {X > t}, can be computed from the 
conditional distribution by differentiation. For example, 


Fx(2|X > t) 2 PIX <a|X > t] 
_ Pix <2,X >t] 


Pik Sq (2.7-10) 


The event {X < x, X > t} is clearly empty if t > z. If t < z, then {X <2,X >t} = 
{t < X < x}. Thus, 
0, t>, 
Fx(2|X >t)= 4 Fx(z) - Fx(t) 
1—Fx(t) ’ 


>t. (2.7-11) 
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Hence 


t> a, 


0, 
fx(z|X 2 t) = l fx(@) >t (2.7-12) 
1- Fx *=" 


The connection between a(t) and fx(z|X > t) is obtained by comparing Equation 2.7-12 
with Equation 2.7-5, that is, ` 


fx(t]X > t) = a(t). (2.7-13) 


Example 2.7-2 0 — — — — Ž — ~ S S 
(Itsibitsi breakdown) Oscar, a college student, has a nine-year-old Itsibitsi, an import car 
famous for its reliability. The conditional failure rate, based on field data, is a(t) = 0.06tu(t) 
assuming a normal usage of 10,000 mile/year. To celebrate the end of the school year, Oscar 
begins a 30-day cross-country motor trip. What is the probability that Oscar’s Itsibitsi will 
have a first breakdown during his trip? 


Solution First, we compute the pdf fx (t) as 
fx(t) = 0.06te~ Jo 9-96t' at’, (4) (2.7-14) 
= 0.06te t u(t), (2.7-15) 
Next, we convert 30 days into 0.0824 years. Finally, we note that 


P[9. <9. 
P{9.0 < X < 9.0824|X > 9] = P[9.0 < X < 9.0824] 


1 — Fx (9) ? 
where we have used Bayes’ rule and the fact the event {9 < X < 9.0824} Nn {X > 9} = {9< 
X < 9.0824}. 
Since 
9.0824 2 
P(9.0 < X < 9.0824] = 0.06 f te." dt (2.7-16) 
9.0 
(9.0824)? 
_ 0.06 e°32dz with zê,  (2.7-17) 
2 (9.0)? 
— 9.06 1 /,-0.03(9.0)? __,~0.03(9.0824)? 
=> 00 l É ) (27-18) 
_ (c-0-0800.0)7 _ e70.03(9.0824)") (2.7-19) 
~ 0.0038 (2.7-20) 
and 


1 — Fx (9) = 0.088, 
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Oscar’s car has a 3.8 x 1073/8.8 x 107? or 0.043 probability of suffering a breakdown in the 


next 30 days. 
Incidentally, the probability that a newly purchased Itsibitsi will have at least one 
breakdown in ten years is 0.95. 





SUMMARY 


The material discussed in this chapter is central to the concept of the whole book. We began 
by defining a real random variable as a mapping from the sample space Q to the real line 
R. We then introduced a point function F(x) called the cumulative distribution function 
(CDF), which enabled us to compute the probabilities of events of the type {¢: Ç € Q, 
X(¢) < x}. The probability density function (pdf) and probability mass function (PMF) 
were derived from the CDF, and a number of useful and specific probability laws were 
discussed. We showed how, by using Dirac delta functions, we could develop a unified theory 
for both discrete and continuous random variables. We then discussed joint distributions, 
the Poisson transform, and its inverse and the application of these concepts to physical 
problems. 

We discussed the important concept of conditional probability and illustrated its appli- 
cation in the area of conditional failure rates. The conditional failure, often high at the 
outset, constant during mid-life, and high at old age, is fundamental in determining the 
probability law of time-to-failure. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
2.1 The event of k successes in n tries regardless of the order is the binomial law b(k, n; p). 
Let n = 10, p = 0.4. Define the RV X by 


1, forrO<k <2, 
2, for2<k <5, 
3, for5<k <8, 
4, for8<k< 10. 


Compute the probabilities P[X = j] for 7 = 1,...,4. Plot the CDF Fx (zr) = P[X < 
z] for all z. 

*2.2 Consider the probability space (Q, F, P). Give an example, and substantiate it in a 
sentence or two, where all outcomes have probability zero. Hint: Think in terms of 
random variables. 

2.3 Inarestaurant known for its unusual service, the time X, in minutes, that a customer 
has to wait before he captures the attention of a waiter is specified by the following 
CDF: 


X(k) = 


154 


2.4 


2.5 


*2.6 


2.7 
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2 
) , fordO<a<1, 


om 
NIR 


D for 1 <a <2, 
=< 1 
Fx(z) pi for 2 < x < 10, 
= for 10 < x < 20 
20’ 7 Tae 
1, for x > 20. 


(a) Sketch Fx (x). (b) Compute and sketch the pdf fx (x). Verify that the area under 
the pdf is indeed unity. (c) What is the probability that the customer will have to 
wait (1) at least 15 minutes, (2) less than 5 minutes, (3) between 5 and 15 minutes, 
(4) exactly 1 minute? 

Compute the probabilities of the events {X < a}, {X < a}, {a < X <b}, {a< X < 
b}, {a < X < b}, and {a < X < b} in terms of Fx (zx) and P[X = 2] for z = a,b. 
In the following pdf’s, compute the constant B required for proper normalization: 
Cauchy (a < 00, 8 > 0): 


fx(z) = — 8B _ —oo0 < T < 0 
ee I+ [e-a 
Maxwell (a > 0): 
Brei? ag > 0 
z)= 2 , 
fx(2) { 0, otherwise. 
For these more advanced pdf’s, compute the constant B required for proper normal- 
ization: 
Beta (b > —1,¢ > —1): 
_ f| BĖ(-r), 0<2<1, 
fx(z) = io otherwise. 


(See formula 6.2-1 on page 258 of [2-6].) 
Chi-square (ø > 0,n = 1,2,...): 


_ | Br/2)-1 exp(—2/207), «> 0, 
Ix(2) = {? otherwise. 
Let X be a continuous random variable with pdf 


fx(z) = kz, OSa2<1 
x"! 10, otherwise 


where k is a constant. 


(a) Determine the value of k and sketch fx (zx). 
(b) Find and sketch the corresponding CDF Fx(z). 
(c) Find P (1⁄4 < X < 2) 
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2.8 
2.9 


2.10 


2.11 


2.12 


2.14 


Compute Fy (ko) for the Rayleigh pdf (Equation 2.4-15) for k = 0,1,2,.... 

Write the probability density functions (using delta functions) for the Bernoulli, 
binomial, and Poisson PMF’s. 

The pdf of a RV X is shown in Figure P2.10. The numbers in parentheses indicate 
area. (a) Compute the value of A; (b) sketch the CDF; (c) compute P[2 < X < 3]; 





0 1 2 3 4 x 


Figure P2.10 pdf of a Mixed RV. 


(d) compute P[2 < X < 3]; (e) compute Fx (3). 

The CDF of a random variable X is given by Fx(x) = (1 — e~**)u(z). Find the 
probability of the event {¢: X(¢) < 1 or X(€) > 2}. 

The pdf of random variable X is shown in Figure P2.10. The numbers in parentheses 
indicate area. Compute the value of A. Compute P[2 < X < 4]. 

(two coins tossing) The experiment consists of throwing two indistinguishable coins 
simultaneously. The sample space is Q ={two heads, one head, no heads}, which we 
denote abstractly as Q = {¢1, C2, 3}. Next, define two random variables as 


X(C) = 0, X (C2) = 0, X (C3) =1 
Y(Ci)=1, Y(C2) = -1, Y(C3) = 1. 


(a) Compute all possible joint probabilities of the form P[¢ : X(¢) = a, Y (¢) = 
£l, ae {0, 1}, B € {-1, 1}. 
(b) Determine whether X and Y are independent random variables. 
The pdf of the random variable X is shown in Figure P2.14. The numbers in paren- 


theses indicate the corresponding impulse area. 
So, 


Az’, |z| < 2, 


0, else. 


fx(z) = sole +2)+ “gle + 1)+ ole -1)+ { 


Note that the density fx is zero off of [—2, +2]. 
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2.15 


*2.16 


2.17 


2.18 


2.19 


f@) 





(1/16) 





-2 -~l 0 1 2 


Figure P2.14 pdf of the Mixed pv in the problem 2.14. 


(a) Determine the value of the constant A. 

(b) Plot the CDF Fx (x). Please label the significant points on your plot. 
(c) Calculate Fx (1). 

(d) Find P[-1< X <Q]. 


Consider a binomial RV with PMF 6(k; 4,4). Compute P[X = k|X odd] for k = 
0,...,4. 

Continuing with Example 2.6-8, find the marginal distribution function Fy (n). Find 
and sketch the corresponding PMF Py(n). Also find the conditional probability 
density function fw(w|N = n) = fw\n(wln). (In words fw\n(w|n) is the pdf of W 
given that N =n.) 

The time-to-failure in months, X, of light bulbs produced at two manufacturing 
plants A and B obey, respectively, the following CDFs 


Fx (2) = (1 — e~*/5)u(zx) for plant A (2.7-21) 
Fx (x) = (1 — e7*/?)u(z) for plant B. (2.7-22) 


Plant B produces two times as many bulbs as plant A. The bulbs, indistinguishable 
to the eye, are intermingled and sold. What is the probability that a bulb purchased 
at random will burn at least (a) two months; (b) five months; (c) seven months? 

A smooth-surface table is ruled with equidistant parallel lines, a distance D apart. A 
needle of length L, where L < D, is dropped onto the table. What is the probability 
that the needle will intersect one of the lines? 

It has been found that the number of people Y waiting in a queue in the bank on 
payday obeys the Poisson law as 


k 
PIY =k|X =a] =e*, k>0,2>0 
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2.20 


2.21 


2.22 


2.23 


given that the normalized serving time of the teller x (i.e., the time it takes the teller 
to deal with a customer) is constant. However, the serving time is more accurately 
modeled as an RV X. For simplicity let X be a uniform RV with 


fx(x) = 3[u(z) — uz — 5)]. 


Then P[Y = k|X = z] is still Poisson but P[Y = k] is something else. Compute 
P[Y =k] for k = 0,1, and 2. The answer for general k may be difficult. 
Suppose in a presidential election each vote has equal probability p = 0.5 of being 
in favor of either of two candidates, candidate 1 and candidate 2. Assume all votes 
are independent. Suppose 8 votes are selected for inspection. Let X be the random 
variable that represents the number of favorable votes for candidate 1 in these 8 
votes. Let A be the event that this number of favorable votes exceeds 4, that is, 
A= {X > 4}. 
(a) What is the PMF for the random variable X ? Note that the PMF should be 
symmetric about X = k = 4. 
(b) Find and plot the conditional distribution function Fx (x|A) for the range 
-l1<2< 10. 
(c) Find and plot the conditional pdf fx(z{A) for the range —1 < z < 10. 
(d) Find the conditional probability that the number of favorable votes for candi- 
date 1 is between 4 and 5 inclusive, that is, P[4 < X < 5|A]. 


Suppose that random variables X and Y have a joint density function 


_ fAQr+y),2<241<6,0<y<5 
f(z,y) = { 0, otherwise 
Find 
(a) the constant A 
(b) P(X>3) 
(c) Fy(2) 


(d) P(3< X <4/Y >2) 
Consider the joint pdf of X and Y: 


1 a 2 2 
fxv (0,y) = gre HEND +a u(a)u(y). 


Are X and Y independent RVs? Compute the probability of {0 < X <2,0< Y <3}. 
The radial miss distance (in meters/m?) of the landing point of a parachuting sky 
diver from the centre of the target area is known to be a Rayleigh random variable 
X with parameter o? = 100. 
(a) Find the radius r such that P (X >r) =e71! 
(b) Find the probability of the sky diver landing within a 10m radius from the 
centre of the target area, given that the landing is within 50m from the centre 
of the target area. 
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2.24 


*2.25 


2.26 


2.27 


2.28 


2.29 
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Show that Equation 2.6-75 factors as fx(x)fy(y) when p = 0. What are fx(z) and 
fy(y)? For ø = 1 and p = 0, what is P[-4 < X < $,-3<Y < 4]? 

Consider a communication channel corrupted by noise. Let X be the value of the 
transmitted signal and Y be the value of the received signal. Assume that the condi- 
tional density of Y given {X = x} is Gaussian, that is, 





_ — 2 
fyix (ylz) = aa (E52) , 


and X is uniformly distributed on |—1, 1]. What is the conditional pdf of X given 
Y, that is, fxjy (zly)? 

Consider a communication channel corrupted by noise. Let X be the value of the 
transmitted signal and Y the value of the received signal. Assume that the condi- 
tional density of Y given X is Gaussian, that is, 





_ ny2 
fryxlule) = Gag exp -E 


and that X takes on only the values +1 and —1 equally likely. What is the conditional 
density of X given Y, that is, fx jy (zly)? 

The arrival time of a professor to his office is a continuous RV uniformly distributed 
over the hour between 8 A.M. and 9 A.M. Define the events: 


A = {The prof. has not arrived by 8.30 A.M.}, (2.7-23) 
B = {The prof. will arrive by 8:31 A.M.}. (2.7-24) 
Find 
(a) P[{B|A). 
(b) P[A|B]. 
Let X be a random variable with pdf 
0, z<O0, 
fx(z) = h an >, (c > 0). 


(a) Find c; 
(b) Leta>0, z > 0, find P[X > z +a]; 
(c) Let a > 0, z > 0, find P[X > z +aļ|X > al. 


To celebrate getting a passing grade in a course on probability, Wynette invites 
her Professor, Dr. Chance, to dinner at the famous French restaurant C’est Tres 
Chere. The probability of getting a reservation if you call y days in advance is given 
by 1 — e™”, where y > 0. What is the minimum numbers of days that Wynette 
should call in advance in order to have a probability of at least 0.90 of getting a 
reservation? 
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2.30 


2.31 


*2.32 


2.33 


2.34 


*2.35 


A U.S. defense radar scans the skies for unidentified flying objects (UFOs). Let M 
be the event that a UFO is present and M° the event that a UFO is absent. Let 


fxjm(z|M) = Taz exp(—0.5[z — r]*) be the conditional pdf of the radar return 


signal X when a UFO is actually there, and let fx;y(2z/M°) = Te exp(—0.5[z]*) 
be the conditional pdf of the radar return signal X when there is no UFO. To be 
specific, let r = 1 and let the alert level be x4 = 0.5. Let A denote the event of an 
alert, that is, {X > 2,4}. Compute P[A|M], P[A°|M], P[A|M*], P[A‘c|M°]. 

In the previous problem assume that P[M] = 1074. Compute 


P[M|A], P[M|A‘], P[M°‘|A], P{M°|A‘]. Repeat for P[M] = 10-°. 


Note: By assigning drastically different numbers to P[M], this problem attempts to 
illustrate the difficulty of using probability in some types of problems. Because a 
UFO appearance is so rare (except in Roswell, New Mexico), it may be considered a 
one-time event for which accurate knowledge of the prior probability P[M] is near 
impossible. Thus, in the surprise attack by the Japanese on Pearl Harbor in 1941, 
while the radar clearly indicated a massive cloud of incoming objects, the signals 
were ignored by the commanding officer (CO). Possibly the CO assumed that the 
prior probability of an attack was so small that a radar failure was more likely. 
(research problem: receiver-operating characteristics) In Problem, P[A|M‘] is known 
as a, the probability of a false alarm, while P|M|A] is known as £, the probability 
of a correct detection. Clearly a = a(r,4), 3 = (za). Write a MATLAB program to 
plot 8 versus a for a fixed value of r. Choose r = 0,1,2,3. The curves so obtained 
are known among radar people as the receiver-operating characteristic (ROC) for 
various values of r. 

A sophisticated house security system uses an infrared beam to complete a circuit. If 
the circuit is broken, say by a robber crossing the beam, a bell goes off. The way the 
system works is as follows: The photodiode generates a beam of infrared photons at 
a Poisson rate of 8 x 10° photons per second. Every microsecond a counter counts the 
total number of photons collected at the detector. If the count drops below 2 photons 
in the counting interval (10~® seconds), it is assumed that the circuit is broken and 
the bell rings. Assuming the Poisson PMF, compute the probability of a false alarm 
during a one-second interval. 

A traffic light can be in one of three states: green (G), red (R), and yellow (Y). 
The light changes in a random fashion (e.g., the light at the corner of Hoosick and 
Doremus in Nirvana, New York). At any one time the light can be in only one state. 
The experiment consists of observing the state of the light. 


(a) Give the sample space of this experiment and list five events. 
(b) Let a random variable X(-) be defined as follows: X(G) = —1; X(R) = 0; 
X(Y) =7. Assume that P[G] = P[Y] = 0.5 x P[R]. Plot the pdf of X. What 
is P[X < 3]? 
A token-based, multi-user communication system works as follows: say that nine 
user-stations are connected to a ring and an electronic signal, called a token, is passed 
around the ring in, say, a clockwise direction. The token stops at each station and 
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allows the user (if there is one) up to five minutes of signaling a message. The token 
waits for a maximum of one minute at each station for a user to initiate a message. 
If no user appears at the station at the end of the minute, the token is passed on 
to the next station. The five-minute window includes the waiting time of the token 
at the station. Thus, a user who begins signaling at the end of the token waiting 
period has only four minutes of signaling left. 


(a) Assume that you are a user at a station. What are the minimum and maximum 
waiting times you might experience? The token is assumed to travel instan- 
taneously from station to station. 

(b) Let the probability that a station is occupied be p. If a station is occupied, 
the “occupation time” is a random variable that is uniformly distributed in 
(0,5) minutes. Using MATLAB, write a program that simulates the waiting 
time at your station. Assume that the token has just left your station. Pick 
various values of p. 


2.36 All manufactured devices and machines fail to work sooner or later. Suppose that the 
failure rate is constant and the time to failure (in hours) is an exponential random 
variable X with parameter À. 


(a) Measurements show that the probability that the time to failure for computer 
memory chips in a given class exceeds 10~4 is e~!. Calculate À. 

(b) Using the above value of 4, calculate the time zo such that 0.05 is the prob- 
ability that the time to failure is less than zo. 


2.37 We are given the following joint pdf for random variables X and Y: 


_{A0< alti <1, 
fev = 45s Ut 


(a) What is the value of the constant A? 

(b) What is the marginal density fx (x)? 

(c) Are X and Y independent? Why? 

(d) What is the conditional density fy)x (ylz)? 





Figure P2.37 Support of fxy(x, y) in problem 2.37. 
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2.38 A laser used to scan the bar code on supermarket items is assumed to have a constant 
conditional failure rate 4(>0). What is the maximum value of À that will yield 
a probability of a first breakdown in 100 hours of operation less than or equal 
to 0.05? 

2.39 Compute the pdf of the failure time X if the conditional failure rate a(t) is as shown 
in Figure P2.39. 


a(t) 





Figure P2.39 Failure rate a(t) in problem 2.39. 


2.40 Two people agree to meet between 2.00 p.m. and 3.00 p.m, with the understanding 
that each will wait no longer than 15 minutes for the other. What is the probability 
that they will meet? 
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Functions of Random 
Variables 





] 


3.1 INTRODUCTION 


A classic problem in engineering is the following: We are given the input to a system and 
we must calculate the output. If the input to a system is random, the output will generally 
be random as well. To put this somewhat more formally, if the input at some instant t or 
point z is a random variable (RV), the output at some corresponding instant t’ or point z’ 
will be a random variable. Now the question arises, if we know the CDF, PMF, or pdf of the 
input RV can we compute these functions for the output RV? In many cases we can, while 
in other cases the computation is too difficult and we settle for descriptors of the output 
RV which contain less information than the CDF. Such descriptors are called averages or 
expectations and are discussed in Chapter 4. In general for systems with memory, that is, 
systems in which the output at a particular instant of time depends on past values of the 
input (possibly an infinite number of such past values), it is much more difficult (if not 
impossible) to calculate the CDF of the output. This is the case for random sequences 
and processes to be treated in Chapters 7 and 8. In this chapter, we study much simpler 
situations involving just one or a few random variables. We illustrate with some examples. 





Example 3.1-1 
(power loss in resistor) As is well known from electric circuit theory, the current J flowing 
through a resistor R (Figure 3.1-1) dissipates an amount of power W given by 


W Ê W(I) = PR. (3.1-1) 
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T j 


Figure 3.1-1 Ohmic power dissipation in a resistor. 


Equation 3.1-1 is an explicit rule that generates for every value of I a number W (I). This 
rule or correspondence is called a function and is denoted by W(-) or merely W or some- 
times even W(J)—although the latter notation obscures the difference between the rule 
and the actual number. Clearly, if J were a random variable, the rule W = I?R generates a 
new random variable W whose CDF might be quite different from that of I.t Indeed, this 
alludes to the heart of the problem: Given a rule g(-), and a random variable X with pdf 
fx (x), what is the pdf fy (y) of the random variable Y = g(X)? 





The computation of fy (y), Fy (y), or the PMF of Y, that is, Py (yi), can be very simple 
or quite complex. We illustrate such a computation with a second example, one that comes 
from communication theory. 





Example 3.1-2 
(waveform detector) A two-level waveform is made analog because of the effect of additive 
Gaussian noise (Figure 3.1-2). A decoder samples the analog waveform x(t) at tg and decodes 
according to the following rule: 


Input to Decoder x | Output of Decoder y 





If z(to): Then y is assigned: 
23 1 
<3 0 


What is the PMF or pdf of Y? 


Solution Clearly with Y (an RV) denoting the output of the decoder, we can write the 
following events: 
{Y =0} = {X < 0.5} (3.1-2a) 


{Y =1} = {X > 0.5}, (3.1-2b) 
where X ê z(to). Hence if we assume X : N(1,1), we obtain the following: 
1 0.5 2 
P[Y = 0] = PIX < 0.5] = zl e-1/2(2-1)?ds 
[Y =0] = P| = S 
~ 0.31. (3.1-3) 


tThis is assuming that the composite function I? (¢)R satisfies the required properties of an RV (see 
Section 2.2). 
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(a) (b) (c) (d) 


Figure 3.1-2 Decoding of a noise-corrupted digital pulse by sampling and hard clipping. 





Figure 3.1-3 The area associated with P[Y = 0] in Example 3.1-2. 


In arriving at Equation 3.1-3 we use the normalization procedure explained in Section 2.4 
and the fact that for X: N(0,1) and any x < 0, the CDF Fx(x) = § — erf(|z|). The area 
under the Normal N(1,1) curve associated with P[Y = 0] is shown in Figure 3.1-3. 

In a similar fashion we compute P[Y = 1] = 0.69. Hence the PMF of Y is 


0.31, y = 0, 
Py(y) = < 0.69, y = 1, (3.1-4) 
0, else. 


Using Dirac delta functions, we can obtain the pdf of Y: 


fr (y) = 0.31 6(y) + 0.69 6(y — 1). (3.1-5) 
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In terms of the Kronecker delta function, that is, 6(y) = 1 at y = 0, equal 0 else, the PMF 


would be 
Py (y) = 0.31 6(y) + 0.69 6(y — 1). 


The Knonecker ô is used in PMFs while the Dirac 6 is used in pdffs. We keep the symbols 
the same although they mean different things. 


Of course not all function-of-a-random-variable (FRV) problems are this easy to evaluate. 
To gain a deeper insight into the FRV problem, we take a closer look at the underlying 
concept of FRV. The gain in insight will be useful when we discuss random sequences and 
processes beginning in Chapter 7. 


Functions of a Random Variable (FRV): Several Views 


There are several different but essentially equivalent views of an FRV. We will now present 
two of them. The differences between them are mainly ones of emphasis. 

Assume as always an underlying probability space A= (9, ZP) and a random variable 
X defined on it. Recall that X is a rule that assigns to every ¢ € Q a number X(C). X 
transforms the o-field of events . into the Borel o-field . of sets of numbers on the real 
line. If Rx denotes the subset of the real line actually reached by X as Ç roams over Q, then 
we can regard X as an ordinary function with domain 2 and range Rx. Now, additionally, 
consider a measurable real function g(x) of the real variable z. 


First view (Y : Q — Ry). For every Ç € Q, we generate a number g(X(¢)) 4 Y(¢). The 
rule Y, which generates the numbers {Y (¢)} for random outcomes {¢ € Q}, is an RV with 
domain N and range Ry C R!. Finally for every Borel set of real numbers By, the set 
{¢: Y(¢) € By} is an event. In particular the event {¢: Y(¢) < y} is equal to the event 
{¢: 9(X(¢)) < y}- 

In this view, the stress is on Y as a mapping from 2 to Ry. The intermediate role of 
X is suppressed. 


Second view (input/output systems view). For every value of X(¢) in the range Rx, 
we generate a new number Y = g(X) whose range is Ry. The rule Y whose domain is Rx 
and range is Ry is a function of the random variable X. In this view the stress is on viewing 
Y as a mapping from one set of real numbers to another. A model for this view is to regard 
X as the input to a system with transformation function g(-).! For such a system, an input 
x gets transformed to an output y = g(x) and an input function X gets transformed to an 
output function Y = g(X). (See Figure 3.1-4.) 

The input-output viewpoint is the one we stress, partly because it is particularly useful 
in dealing with random processes where the input consists of waveforms or sequences of 
random variables. The central problem in computations involving FRVs is: Given g(x) and 
Fx (2), find the point set C, such that the following events are equal: 


tg can be any measurable function; that is, if Ry is the range of Y, then the inverse image (see Section 
2.2) of every subset in Ry generated by countable unions and intersections of sets of the form {Y < y} is 
an event. 
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fee 


Figure 3.1-4 Input/output view of a function of a random variable. 








{C YC) <y} = {¢: 9[X(C)] < y} 
= {Ç : X(¢) € Cy}. (3.1-6a) 


In general we will economize on notation and write Eq. 3.1-6a as {Y < y} = {X € Cy} in 
the sequel. For Cy so determined it follows that 


PIY < y] =P[X € C] (3.1-6b) 


since the underlying event is the same. If C, is empty, then the probability of {Y < y} is 
zero. 

In dealing with the input-output model, it is generally convenient to omit any references 
to an abstract underlying experiment and deal, instead, directly with the RVs X and Y. 
In this approach the underlying experiments are the observations on X, events are Borel 
subsets of the real line R!, and the set function P[-] is replaced by the distribution function 
Fx(-). Then Y is a mapping (an RV) whose domain is the range Rx of X, and whose range 
Ry is a subset of R1. The functional properties of X are ignored in favor of viewing X 
as a mechanism that gives rise to numerically valued random phenomena. In this view the 
domain of X is irrelevant. 

Additional discussion on the various views of an FRV are available in the literature.t 


3.2 SOLVING PROBLEMS OF THE TYPE Y = g(X) 


We shall now demonstrate how to solve problems of the type Y = g(X). Eventually we shall 
develop a formula that will enable us to solve problems of this type very rapidly. However, 
use of the formula at too early a stage of the development will tend to mask the underlying 
principles needed to deal with more difficult problems. 


Example 3.2-1 
Let X be a uniform RV on (0,1), that is, X:U(0,1), and let Y = 2X + 3. Then we need to 
find the point set C, in Equation 3.1-6b to compute Fy (y). Clearly 





{Y < y} = {2X +3 < y} = {X < į(y - 3)}. 


Hence C, is the interval (—oo, ¿(y — 3)) and 


Fy (y) = Fx (45) . 


tFor example see Davenport (3-1, p.174] or Papoulis and Pillai [3-5]. 
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(b) 
Figure 3.2-1 (a) Original pdf of X; (b) the pdf of Y= 2X + 3. 


The pdf of Y is 


ro) ~ t ee (238)] =r (3°) 


The solution is shown in Figure 3.2-1. 





Generalization. Let Y = aX +b with X a continuous RV with pdf fx (x). Then for a > 0 
the outcomes {Ç} C Q that produce the event {aX +b < y} are identical with the outcomes 
{¢} C N that produce the event {X < s=), Thus, 





Y syh= {ax tosu = fxs), 


From the definition of the CDF: 





Fy (y) = Fx ( - *) (3.2-1) 


and so i b 
fr(y) = zx ==) . (3.2-2) 


For a < 0, the following events are equalt 


Y <u}= {ax +b<y}= {x> t), 


tBy which we mean that the event {¢: Y (C) < y} = fe: X(¢) > ut}. 
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Since the events 4 X < ut and {x > ut} are disjoint and their union is the certain 


event, we obtain from Axiom 3 


pix <4) +P [x21] -1 
a a 





Finally for a continuous RV 





p |x > 2] -r [x> 1]. 
a a 


Thus, for a < 0 





Fy(y) =1- Fx (==) (3.2-3) 
and i , 
_i y- g 
fy(y) = ial? * ( a ) , a0. (3.2-4) 


When X is not necessarily continuous, we would have to modify the development in the 
case a < 0 because it may no longer be true that P [x < a=] =P [x < zz] because of 
the possibility that the event {X = uP} has a positive probability. The modified statement 
then becomes P |X < y= =P|X< z=] — P|X = z=] = Fy (2 — Px (4), where 


we have employed the PMF Px to subtract the probability of this event. The final answer 
for the case a < 0 must be changed accordingly. 


Example 3.2-2 
(square-law detector) Let X be an RV with continuous CDF F(x) and let Y 2 X2, Then 


{Y <y} = {X? < y} ={-Vu < X < Vu} = {-Va < X < Vy} U {X = -va}- (3.2-5) 


The probability of the union of disjoint events is the sum of their probabilities. Using the 
definition of the CDF, we obtain 


Fy (y) = Fx (Jy) — Fx (-vy) + PIX = - vyl. (3.2-6) 
If X is continuous, P[X = —,/y] = 0. Then for y > 0, 





d 1 1 
= — B = — — — . 2- 
ro) = EO) = 9 glx (VD + gixa) (3.2-7) 
For y < 0, fy (y) = 0. How do we know this? Recall from Equation 3.1-6a that if C, is 
empty, then P[Y € C,] = 0 and hence fy(y) = 0. For y < 0, there are no values of the RV 
X on the real line that satisfy 
{-Vy < X < Vy}. 
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Hence fy (y) = 0 for y < 0. If X: N(0, 1), then from Equation 3.2-7, 


fru) = 





1 1 
~2Y 3.2-8 
V u(y), (3.2-8) 
where u(y) is the standard unit step function. Equation 3.2-7 is the Chi-square pdf with 
one degree-of-freedom. 


Example 3.2-3 
(half-wave rectifier) A half-wave rectifier has the transfer characteristic g(r) = ru(z) 
(Figure 3.2-2). 





g(x) 


x 


Figure 3.2-2 Half-wave rectifier. 


Thus, . 
Py(y) = Pux) <= f fx(z)de. (3.29) 
{x: ru(z)<y} 
(i) Let y > 0; then {z: zu(z) < y} = {z: z > 0;x < y} U {z: z <0} = {2:2 < y}. 
Thus Fy (y) = fro fx (x)dz = Fx (y). 

(ii) Next let y = 0. Then P[Y = 0] = P[X < 0] = Fx (0). 

(iii) Finally let y < 0. Then {x: xu(x) < y} = ¢ (the empty set). 
Thus, 
Fy (y) = | fx(a)de =0. 
$ 


If X: N(0,1), then Fy (y) has the form in Figure 3.2-3. 
The pdf is obtained by differentiation. Because of the discontinuity of y = 0, we obtain 
a Dirac impulse in the pdf at y = 0, that is, 


0, y <0, 
fry) = 4 Fx(0)ð(y), y=0, (3.2-10) 
Fy(y) 


j= 


Figure 3.2-3 The CDF of Y when X: N(0, 1) for the half-wave rectifier. 
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This can be compactly written as fy(y) = fx(y)u(y) + Fx(0)d(y). We note that in this 
problem it is not true that P[Y <0] = P[Y <0]. There is a non-zero probability that 
P{Y = 0]. 





Example 3.2-4 
Let X be a Bernoulli RV with P[X = 0] = p and P[X = 1] = gq. Then 


fx(xz) = pé(x) + q6(@ — 1) and Fy (x) = pu(x) + qu(z — 1), 


where u(x) is the unit-step function of continuous variable z. 
Let Y Ê X — 1. Then (Figure 3.2-4) 


Fy(y) = P|X -1 <4] 
=P[X<y+1] 
= Fx(y +1) 
= pu(y + 1) + qu(y). 


The pdf is 


fr(y) = ‘ [Fy (y)] = ply + 1) + 46(y). (3.2-11) 





(b) 


Figure 3.2-4 (a) CDF of X; (b) CDF of Y= X—1. 
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Example 3.2-5 
(transformation of CDFs) Let X have a continuous CDF Fx (2) that is a strict monotone 
increasing function! of x. Let Y be an RV formed from X by the transformation with the 
CDF function itself, 





Y = Fx(X). (3.2-12) 
To compute Fy (y), we proceed as usual: 
{Y < y} = {Fx(X) <y} 
= {X < Fx"(y)}. 
Hence 
Fy (y) = PIFx(X) < yl 
= P[X < Fx"(y)] 


= J fx(z)dz. 
{z: Fx(z)<y} 


1. Let y < 0. Then since 0 < Fx(x) < 1 for all z € [—co, oo], the set {x: Fx(x) < 
y} = ġ and Fy (y) =0. 

2. Let y > 1. Then {z: Fx (x) < y} = [—00, 00] and Fy (y) = 1. 

3. Let 0 < y< 1. Then {z: Fx(z) < y} ={z: z< Fz! (y)} 


and 
F3 (v i 
Fy(y)= |” Jx(ajde = Fx(Fx 0) =v. 
Hence 
0, y<0, 
Fy(y)=§ v, O<y<1, (3.2-13) 
l, y>l 


Equation 3.2-13 says that whatever probability law X obeys, so long as it is continuous and 
strictly monotonic, Y 4 Fx (X) will be a uniform. Conversely, given a uniform distribution 
for Y, the transformation X £ Fx‘ (Y) will generate an RV with contiuous and strictly 
monotonic CDF Fx (x) (Figure 3.2-5). This technique is sometimes used in simulation to 
generate RVs with specified distributions from a uniform RV. 

Example 3.2-6 — — — > o 
(transform uniform to standard Normal) From the last example, we can transform a uniform 


RV X:U(0, 1] to any continuous distribution that has a strictly increasing CDF. If we want 
a standard Normal, that is, Gaussian Y:N (0,1), its CDF is given as 


In other words z2 > zı implies Fx (z2) > Fx (x1), that is, without the possibility of equality. 
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Figure 3.2-5 Generating an RV with CDF Fx(x) from a uniform RV. (a) Creating a uniform RV Y 
from an RV X with CDF Fx(x); (b) creating an RV with CDF Fx(x) from a uniform RV Y. 


3 terf(y), y 2 0, 
Fy(y)=4 , 
z — erf(—y), y < 0. 
Solving for the required transformation g(x), we get 
—erf "(3 -2£),0<2< z, 
g(x) = —1 1) 1 
erf (z- 53h33 <@<l. 


A plot of this transformation is given in Figure 3.2-6. 

A MATLAB program transformCDF.m available at the book website can be used to 
generate relative frequency histograms of this transformation in action. The following results 
were obtained with 1000 trials. Figure 3.2-7 shows the histogram of the 1000 RVs distributed 
as U[0, 1]. Figure 3.2-8b shows the corresponding histogram of the transformed RVs. 


Example 3.2-7 
(quantizing) In analog-to-digital conversion, an analog waveform is sampled, quantized, and 
coded (Figure 3.2-9). A quantizer is a function that assigns to each sample z, a value from a 





set Q 4 {y-n.---,Yo,-.-, yn} of 2N +1 predetermined values [3-2]. Thus, an uncountably 


174 Chapter 3 Functions of Random Variables 





Transtorm of X 


x 


Figure 3.2-6 Plot of transformation y = g(x) = Fey(x) that transforms U0, 1] into N(0, 1). 


Histogram of X: Uniformly Distributed [0,1} 





0 01 02 03 04 05 06 07 08 09 1 


Figure 3.2-7 Histogram of 1000 i.i.d. RVs distributed as U[0, 1]. 


infinite set of values (the analog input x) is reduced to a finite set (some digital output y;). 
Note that this practical quantizer is also a limiter, that is, for x greater than some yy or 
less than some y-n, the output is y = yy or y-n, respectively. 

A common quantizer is the uniform quantizer, which is a staircase function of uniform 
step size a, that is, 


g(x) = ia (i-la <<a < ia, i an integer. (3.2-14) 


Thus, the quantizer assigns to each z the closest value of ia above continuous sample value 
3z as is shown by the staircase function in Figure 3.2-10. 
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Histogram of Y 





4 -3 -2 -1 0 1 2 3 4 


Figure 3.2-8 Histogram of 1000 transformed i.i.d. RVs. 






x(t) si 


Sampler 
x; Axit) 


Figure 3.2-9 An analog-to-digital converter. 








Output y(t) 


BA 






input x(t) 


t 


Figure 3.2-10 Quantizer output (staircase function) versus input (continuous line). 
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Figure 3.2-11 Fx(y) versus Fy(y). 


If X is an RV denoting the sampled value of the input and Y denotes the quantizer 
output, then with a = 1, we get the output PMF 


Py) = PIY =i] 
=Pli-1<X <i] 
= Fx (i) — Fx(i— 1). 


The output CDF then becomes the staircase function 


Fy(y) = Pr uy —i) 





= DIF Fx(i— 1ju(y — 4), (3.2-15) 


as sketched in Fig. 3.2-12. 
When y = n (an integer), Fy (n) = Fx(n), otherwise Fy(y) < F'x(y). 


Example 3.2-8 
(sine wave) A classic problem is to determine the pdf of Y = sin X, where X : U(—r, +r), 
that is, uniformly distributed over (—z, +r). From Figure 3.2-12 we see that for 0 < y < 1, 
the event {Y < y} satisfies 





{Y <y} = {sin X <y} 
= {—-r < X < siny} U {r -sinl y < X <r}. 


Since the two events on the last line are disjoint, we obtain 


Fy (y) = Fx(n) — Fx (x — sin y) + Fx(sin™'! y) — Fx (=r). (3.2-16) 
Hence 
_ dFy(y) 
fry) = hy 


sin? y)—— (3.2-17) 


) +f 
"Jio * y 


= fx(r — sin™! 
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1 1 1 1 
2a f/l—y? 2a /1i—y? 
1 1 

=- O<y<l. 
T l1- y? 
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(3.2-18) 


(3.2-19) 


If this calculation is repeated for —1 < y < 0, the diagram in Figure 3.2-12 changes to 
that of Figure 3.2-13. So the event {Y < y} = {sin X < y} expressed in terms of the RV X 


becomes 


{-n -sint y < X < sint y}, 


(3.2-20) 


where we are now using the inverse sin appropriate for y < 0. Then we can write the 


following equation for the CDFs 


Fy (y) = Fx(sin™ y) — Fx(—a — sin™} y). 


sin x 





a — sin 


Figure 3.2-12 Graph showing roots of y = sin x when 0 < y < 1. 


sinx 





Figure 3.2-13 Plot showing roots of y = sin x when —1 < y < 0. 
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Figure 3.2-14 The probability density function of Y = sin X. 


Upon differentiation, we obtain 


fry) = Se) 


1 Da 1 
—— — fx(-r - sin ——— 
Z1 1 re 1 
— Qn /1—¥2 2r /1 — y2 


1 1 
= l<y<0, 


NT /1— y? , 
which is the same form as before when 0 < y < 1. 
Finally we consider |y| > 1, and since |sin(x)| < 1 for all z, we see that the pdf 


fy must be zero there. Combining these three results, we obtain the complete solution 
(Figure 3.2-14): 


= fx(sin7* 








ly| <1, 


a] 


1 
frl) =4 T fly?’ (3.2-21) 


otherwise. 


S 








We shall now go on to derive a simple formula that will enable us to solve many problems 
of the type Y = g(X) by going directly from pdf to pdf, without the need to find the CDF 
first. We shall call this new approach the direct method. For some problems, however, the 
indirect method of this past section may be less prone to error. 


General Formula of Determining the pdf of Y = g(X) 


We are given the continuous RV X with pdf fx(x) and the differentiable function g(x) of 
the real variable x. What is the pdf of Y Ê g(X)? 
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Solution The event {y < Y < y+ dy} can be written as a union of disjoint elementary 
events {E;} in the Borel field generated under X. If the equation y = g(x) has a finite 
number n of real roots? z1,..., Zn, then the disjoint events have the form E; = {2; — 
|dx,| < X < xj} if g'(z:) is negative or E; = {zi < X < z; + |dz;|} if g’(x;) is positive.t 
(See Figure 3.2-15.) In either case, it follows from the definition of the pdf that P[E;] = 
fx(z:)|dz:|. Hence 


P[y < Y < y + dy] = fy (y)|dy| 
= SF fx (z:)|dz:| (3.2-22) 
i=1 
or, equivalently, if we divide through by |dy| 
dx; 7 
W 3 fx (xi) 


At the roots of y = g(x), dy/dz; = g’(z;), and we obtain the important formula 


n dy -1 
fru) = 2, fx (2) an 
yyy dix dx; 














fry) = So fx(z:)/lg'(z:)) te = ily),  g'(z:) #0. (3.2-23) 


i=1 


Equation 3.2-23 is a fundamental equation that is very useful in solving problems where 
the transformation g(x) has several roots. Note that we need to make the assumption that 


g(x) 


fix) 





x + [dx] 
x — |dq| 


Figure 3.2-15 The event {y < Y < y+ dy} is the union of two disjoint events on the probability 
space of X. 


tBy roots we mean the set of points z; such that y — g(2#;) =0,i=1,...,n. 
tThe prime indicates derivatives with respect to z. 
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g'(zi) # 0 at all the roots. To see what happens otherwise, realize that a region where 
g' = 0 is a flat region for the transformation g. So for any x in this flat region, the y value 
is identical, and that will create a probability mass at this value of y whose amount is equal 
to the probability of the event that X falls in this flat region. In terms of the pdf fy, the 
mass would turn into an impulse with area equal to the mass. 

If, for a given y, the equation y — g(x) = 0 has no real roots, then fy = 0 at that y.t 
Figure 3.2-15 illustrates the case when n = 2. 


Example 3.2-9 SSeS 
(trig function of X) To illustrate the use of Equation 3.2-23, we solve Example 3.2-8 by 
using this formula. Thus we seek the pdf of Y = sin X when the pdf of X is fx(z) = 1/27 
for —r < x < 1. Here the function g is g(x) = sin x. The two roots of y—g(z) = y—sinz =0 
for y > 0 are zı = sin! y, z2 = n — sin”! y. Also 


d, 
z = COS T, 
which must be evaluated at the two roots x, and z2. At x; = sin”! y we get dg/dz|z=2, = 
cos(sin™? y). Likewise when zz = m — sin’ y, we get 
d 
= = cos(m — sin™} y) = cos m x cos(sin~! y) + sin x sin(sin~? y) 
T=T2 


= — cos(sin™! y). 


The quantity cos(sin~’ y) can be further evaluated with the help of Figure 3.2-16. There 
we see that 0 = sin’ y and cos@ = y1 — y? = cos(sin—! y). Hence 


dg =VJ1—y?. 


dg dg 
dz z2 


dx 














Tı 


Finally, fx(sin™ty) = fx(m — sin”! y) = 1/27. Using these results in Equation 3.2-23 


enables us to write 1 i 


a Sly 
which is the same result as in Equation 3.2-19. Repeating this procedure for y < 0 then 
gives the same solution for all y as is given in Equation 3.2-21. 


KS 


V1-y? 


fy(y) 0<y<]l, 


Figure 3.2-16 Evaluating cos(sin™ y). 


tThe RV X, being real valued, cannot take on values that are imaginary. 
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y>o 


y<0 





Figure 3.2-17 Roots of g(x) = x" — y = 0 when n is odd. 


Example 3.2-10 —— Ž Ž > S S 
(nonlinear devices) A number of nonlinear zero-memory devices can be modeled by a trans- 
mission function g(z) = z”. Let Y = X”. The pdf of Y depends on whether n is even or 
odd. We solve the case of n odd, leaving n even as an exercise. For n odd and y > 0, the 
only real root to y — 2” = 0 is z, = y'/". Also 


d9 _ gh = ny®-D/n, 


dr 
For y < 0, the only real root is z, = —|y|!/". See Figure 3.2-17. Also 
dg _ (n-1)/n 
Hence 
1 


yO . fely"), y0, 


3 


frlu) = 1 
zE. fxh),  y<0. 


In problems in which g(x) assumes a constant value, say g(x) = c, over some nonzero 
width interval Equation 3.2-23 cannot be used to compute fy (y) because g'(x) = 0 over the 
interval. One additionally has to find the probability mass generated by this flat section. 


Example 3.2-11 —— o 
(linear amplifier with cutoff.) Consider a nonlinear device with transformation as shown in 
Figure 3.2-18. 

The function g(x) is given by 


g(z)=0, |z|>1 (3.2-24) 
g@)=2, -l<a<l. (3.2-25) 
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Figure 3.2-18 A linear amplifier with cutoff. 


Thus g'(x) = 0 for |z| > 1, and g'(x) = 1 for —1 < z < 1. For y > 1 and y < —1, there 
are no real roots to y — g(x) = 0. Hence fy (y) = 0 in this range. For —1 < y < 1, the only 
root to y — g(x) = y — z = 0 is z = y. Hence in this range Equation 3.2-23 applies with 
|g’(z)| = 1 and so fy(y) = fx(y). We note that P[Y = 0] = P[X > 1] + P[X < —1]. If 
X: N(0,1), P[X > 1] = 1/2 — erf(1) = P[X < —1] and so P[Y = 0] = 1 — 2erf(1) = 0.317. 
We would like to incorporate the result that P[Y = 0] = 0.317 into the pdf of Y. We can 
do this with the aid of delta functions realizing that 
O+e 
PIY = 0] = 0.317 = lim 0.3176(y)dy. 
>Y Jo—e 


Hence by including the term 0.3176(y) in fy (y) we obtain the complete solution as: 


0, ly 2 1, 
fru) = —1/2 1,2 
(2) —*/? exp (—3y?) + 0.3176(y), -1<y<1. 


Example 3.2-12 
(infinite roots) Here we consider the periodic extension of the transformation shown in 
Figure 3.2-18. The extended g(x) is shown in Figure 3.2-19. 

The function in this case is described by 


g(a) = 5 (x — 2n) rect (z 3"). 


n=— 00 











Here rect is the symmetric unit-pulse function defined as 


1,-0.5 < x < +0.5, 
rect(z) = 0, else. 


As in the previous example fy(y) = 0 for |y| > 1 because there are no real roots to 
the equation y — g(x) = 0 in this range. On the other hand, when —1 < y < 1, there 
are an infinite number of roots to y — g(x) = 0 and these are given by £n = y + 2n for 
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Figure 3.2-19 Periodic transformation function. 


n = ...,—2,—1,0,1,2,.... At each root |g’(z,)| = 1. Hence, from Equation 3.2-23 we 
obtain fy (y) = 7p fx (y + 2n) rect(#). In the case that X: N(0,1) this specializes to 


fy (y) = (20)71/? > exp (-50 + 2n)?) x rect (3) . 


n=—-0oO 


While this result is correct, it seems hard to believe that the sum of infinite positive terms 
yields a function whose area is restricted to one. To show that fy(y) does indeed integrate 
to one, we proceed as follows: 


T fy (y)dy = 5 p f. exp (-5 + 2n)? ) dy (3.2-26) 
-PÈ [oe e 
-5 lerf(1 + 2n) — erf(—1 + 2n)]. (3.2-28) 


If this last sum is written out, the reader will quickly find that all the terms cancel except 
the first (n = —oo) and the last (n = 00). This leaves that 


T fy (y)dy = erf(oo) — erf({(—o0) = 2 x erf(oo) = 


3.3 SOLVING PROBLEMS OF THE TYPE Z = g(X, Y) 


In many problems in science and engineering, a random variable Z is functionally related 
to two (or more) random variables X, Y. Some examples are 


1. The signal Z at the input of an amplifier consists of a signal X to which is added 
independent random noise Y. Thus Z = X +Y. If X is also an RV, what is the 
pdf of Z? (See Figure 3.3-1.) 
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Noise Y 


Signal X 





Figure 3.3-2 Displacement in the random-walk problem. 


2. A two-engine airplane is capable of flight as long as at least one of its two engines 
is working. If the time-to-failures of the starboard and port engines are X and Y, 
respectively, then the time-to-crash of the airplane is Z 4 max( X,Y). What is the 
pdf of Z? 

3. Many signal processing systems multiply two signals together (modulators, demod- 
ulators, correlators, and so forth). If X is the signal on one input and Y is the signal 
on the other input, what is the pdf of the output Z 2 xy? 

4. In the famous “random-walk” problem that applies to a number of important phys- 
ical problems, a particle undergoes random independent displacements X and Y 
in the x and y directions, respectively. What is the pdf of the total displacement 
Z Ê [X? + Y?]1/2? (See Figure 3.3-2.) 

Problems of the type Z = g(X,Y) are not fundamentally different from the type of 
problem we discussed in Section 3.2. Recall that for Y = g(X) the basic problem was to 
find the point set C, such that the events {¢: Y (C) < y} and {¢: X(¢) € Cy} were equal. 
Essentially, the same problem occurs here as well: Find the point set C, in the (x,y) plane 
such that the events {¢: Z(¢) < z} and {¢: X(¢), Y (Ç) € Cz} are equal, this being indicated 
in our usual shorthand notation by 


{Z < z} ={(X,Y) €C,} (3.3-1) 


and 


Fz(2)= / if „peo, PY (olde dy (3.3-2) 
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The point set C, is determined from the functional relation g(x,y) < z. Clearly in problems 
of the type Z = g(X,Y) we deal with joint densities or distributions and double integrals 
(or summations) instead of single ones. Thus, in general, the computation of fz(z) is 
more complicated than the computation of fy(y) in Y = g(X). However, we have access 
to two great labor-saving devices, which we shall learn about later: (1) We can solve 
many Z = g(X,Y)-type problems by a “turn-the-crank” type formula, essentially an 
extension of Equation 3.2-23, through the use of auziliary variables (Section 3.4); and 
(2) we can solve problems of the type Z = X +Y through the use of characteristic func- 
tions (Chapter 4). However, use of these shortcut methods at this stage would obscure the 
underlying principles. 
Let us now solve the problems mentioned earlier from first principles. 


Example 3.3-1 —  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSeSeSeSeeeseeeeseSeesessseseeee 
(product of RVs) To find C, in Equation 3.3-2 for the CDF of Z = XY, we need to determine 


the region where g(x,y) 4 zy < z. This region is shown in Figure 3.3-3 for z > 0. 
Thus, reasoning from the diagram, we compute 


o0 z/ 0 oo 
Fz(z) = f (/ ' fxvlz,s)az) dy + J (i fxy(z, vaz) dy forz 20. (3.3-3) 


To compute the density fz, it is necessary to differentiate this expression with respect to z. 
We can do this directly on Equation 3.3-3; however, to see this more clearly we first define 
the indefinite integral Gxy (x,y) by 


Gxy(z,y) 2 f fav(e,waz. (3.3-4) 





Figure 3.3-3 The region xy < z for z> 0. 
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Then 
Fz(z) = f [Cxy (z/y, y) — Gxy(—0o, y)]dy 


0 
+ J [Gxy (00, y) — Gxy (z/y, y)]dy 


—oo 
and differentiation with respect to z is fairly simple now to get 


dFz(z) 


fz(z) = da 


œ 1 
= [a hxvle/yway. (3.3.5) 
—co lyl 

We could have gotten the same answer by directly differentiating Equation 3.3-3 with respect 
to z using formula A2-1 of Appendix A. 

The question remains as to what is the answer when z < 0. It turns out that Equation 3.3- 
7 is valid for z < 0 as well, so that it is valid for all z. The corresponding sketch in the 
case when z < 0 is shown in Figure 3.3-4. From this figure, performing the integration 
over the new shaded region corresponding to {zy < z} now in the case z < 0, you should 
get the same integral expression for Fz(z) as above, that is, Equation 3.3-3. Taking the 
derivative with respect to z and moving it inside the integral over y, we then again obtain 
Equation 3.3-7. Thus, the general pdf for the product of two random variables for any value 
of z is confirmed to be 


œ 41 
fale) = [i fxv(eluvldy, -0< z< +o. (3.3-6) 
As a special case, assume X and Y are independent, identically distributed (i.i.d.) RVs 
with 





Figure 3.3-4 The region xy < z for z < 0. 
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falz) 


—a? 0 a2 z 


Figure 3.3-5 The pdf f2(z) of Z = XY when X and Y are i.i.d. RVs and Cauchy. 


This is known as the Cauchy? probability law. Because of independence 
fxy(z,y) = fx(2) fry) 


and because of the evenness of the integrand in Equation 3.3-7, we obtain,' after a change 
of variable, 


a\2 f* 1 1 
fal2) = (2) f z? +r atr” 
a\2 1 z? 
= (=) at — at In of’ (3.3-7) 


See Figure 3.3-5 for a sketch of fz(z) for a = 1. 


Example 3.3-2 
(mazimum. operation) We wish to compute the pdf of Z = max(X,Y) if X and Y are 
independent RVs. Then 








Fz(z) = P[max(X,Y) < 2]. 
But the event {max( X,Y) < z} is equal to {X < z,Y < z}. Hence 


P[Z < z| = P[X < z, Y <2] = Fx(z)Fy(z) (3.3-8) 
and by differentiation, we get 
f2(2) = fy (z)Fx(z) + fx(z)Fy (z2). (3.3-9) 
Again as a special case, let fx(x) = fy (x) be the uniform [0,1] pdf. Then 
fz(z) = 2z[u(z) — u(z — 1)], (3.3-10) 


which is ploted in Figure 3.3-6. The computation of Z = min(X,Y) is left as an end-of- 
chapter problem. 


+ Auguste Louis Cauchy (1789-1857). French mathematician who wrote copiously on astronomy, optics, 
hydrodynamics, function theory, and the like. 

tSee B. O. Pierce and R. M. Foster, A Short Table of Integrals, 4th ed. (Boston, MA: Ginn & Company, 
1956), p. 8. 
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F(x) = fy (x) 
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Figure 3.3-6 The pdf of Z = max(X, Y) for X, Yi.i.d. and uniform in [0, 1). 





Figure 3.3-7 The pdf of the maximum of two independent exponential random variables. 


Example 3.3-3 — > 
(max of exponentials) Let X,Y be iid. RVs with exponential pdf fx(z) = e~*u(x). Let 
Z = max(X,Y). Compute fz(z) and then determine the probability P[Z < 1]. 


Solution From P[Z < z] = P[X < z,Y < z] = P[X < z]P[Y < z], we obtain 
Fz(z) = Fx(z)Fy (z) = (1 — e7*)?u(z) 
and 
_ dFz(z) 


fz(z) = a 2e~*(1 — e~*)u(z). 


The pdf is shown in Figure 3.3-7. Finally, Fz(1) = (1 — e7!)?u(1) = 0.4. 
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The sum of two independent random variables. The situation modeled by Z = X +Y 
(and its extension Z = Sz xı) occurs so frequently in engineering and science that the 


computation of fz(z) is perhaps the most important of all problems of the type 
Z= (X,Y). 

As in other problems of this type, we must find the set of points C, such that the 
event {Z < z} that, by definition, is equal to the event {X + Y < z} is also equal to 


{(X,Y) € Cz}. The set of points C, is the set of all points such that g(x,y) ft ysz 
and therefore represents the shaded region to the left of the line in Figure 3.3-8; any point 
in the shaded region satisfies «+ y < z. 

Using Equation 3.3-2, specialized for this case, we obtain 


-= p: 7a fxy(a, ujde) dy 


= T [Gxy(z—y,y) — Gxy(—00,y)ļdy, (3.3-11) 


where G xy (x,y) is the indefinite integral 


Gxy(z,y) Ê f fy (a,y)de. (3.3-12) 


aN 


Figure 3.3-8 The region C, (shaded) for computing the pdf of ZÊêX+Y. 


190 Chapter 3 Functions of Random Variables 





The pdf is obtained by differentiation of Fz(z). Thus, 


fale) = O = E lCxrl -vyd 





-f ” fxy(z -y y)dy. (3.3-13) 


Equation 3.3-13 is an important result (compare with Equation 3.3-6 for Z = XY). In 
many instances X and Y are independent RVs so that fxy(z,y) = fx(x)fy(y). Then 
Equation 3.3-13 takes the special form 


fa(2) = J ” fle —y)fy(y)dy, (3.3-14) 


which is known as the convolution integral or, more specifically, the convolution of fx with 
fy + It is a simple matter to prove that Equation 3.3-14 can be rewritten as 


falz) = J ” fela)fy(z-— x)dz, (3.3-15) 


by use of the transformation of variables x = z — y in Equation 3.3-14. 


Example 3.3-4 . — >= > >o 
(addition of RVs) Let X and Y be independent RVs with fx(x) = e~*u(x) and fy (y) = 


Lu(y +1) — u(y — 1)] and let Z Ê X +Y. What is the pdf of Z? 


Solution A big help in solving convolution-type problems is to keep track of what is 
going on graphically. Thus, in Figure 3.3-9(a) is shown fx (y) and fy (y); in Figure 3.3-9(b) 
is shown fx(z—y). Note that fx(z—y) is the reverse and shifted image of fx(y). How do 
we know that the point at the leading edge of the reverse/shifted image is y = z? Consider 


fx(z—y) = euz — y). 


But u(z — y) = 0 for y > z. Therefore the reverse/shifted function is nonzero for (—oo, 2] 
and the leading edge of fx(z — y) is at y = z. 

Since fx and fy are discontinuous functions, we do not expect fz(z) to be described by 
the same expression for all values of z. This means that we must do a careful step-by-step 
evaluation of Equation 3.3-14 for different regions of z-values. 


(a) Region 1. z < —1. For z < —1 the situation is as shown in Figure 3.3-10(a). Since 
there is no overlap, Equation 3.3-14 yields zero. Thus fz(z) = 0 for z < —1. 
(b) Region 2. —1 < z < 1. In this region the situation is as in Figure 3.3-10(b). Thus 
Equation 3.3-14 yields Lf 
fale)= 5 | ey 


—1 


1 —(z 
= zle ( +1), 


tA common notation for the convolution integral as in Equation 3.3-15 is fz = fx * fy. 
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Figure 3.3-9 (a) The pdf's fx(y), fy(y); (b) the reverse/shifted pdf fx(z — y). 


(c) Region 3. z > 1. In this region the situation is as in Figure 3.3-10(c). From 
Equation 3.3-14 we obtain 


1 1 
fz(z) = >| e @—Wdy 


-1 


1 
— = [e779 — e (@41)], 
2 


Before collecting these results to form a graph we make one final important observation: 
Since no delta functions were involved in the computation, fz(z) must be a continuous 
function of z. Hence, as a check on the solution, the fz(z) values at the boundaries of the 
regions must match. For example, at the junction z = 1 between region 2 and region 3 

? 


all = eM] = Ble — eH] a. 


Obviously the right and left sides of this equation agree so we have some confidence in 
our solution (Figure 3.3-11). 


Equations 3.3-14 and 3.3-15 can easily be extended to computing the pdf of Z = aX + bY. 


To be specific, let a > 0, b > 0. Then the region g(z, y) â on + by < z is to the left of the 
line y = z/b — ax/b (Figure 3.3-12). 
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Figure 3.3-10 Relative positions fx(z — y) and fy(y) for (a) z < 1; (b) —1 < z < ł; (c)z>1. 


fiz) 





Figure 3.3-11 The pdf fz(z) from Example 3.3-4. 


Hence 
Fz(z) = JI fxy (x, y)dx dy 
9(z,y) <z 


‘00 z/a—by/a 
= J fy (y) ( f fx(a)az) dy. (3.3-16) 
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Figure 3.3-12 The region of integration for computing the pdf of Z = aX + bY shown for a > 0, 
b> 0. 


As usual, to obtain fz(z) we differentiate with respect to z; this furnishes 


fa(z) =+ T fx (2- z) fy(y)dy, (3.3-17) 


a 


Sis 


where we assumed that X and Y are independent RVs. Equivalently, we can compute fz(z) 
by writing 
v Sax 
w Soy 
ZAaViw. 
Then again, assuming a > 0, b > 0 and X, Y independent, we obtain from Equation 3.3-14 


-f ” fv(z-w)fw(w)du, 


where, from Equation 3.2-2, 


and 


Thus, 





f2(z) = Z T fx (Z =) fy (F) dw. (3.3-18) 
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Although Equation 3.3-18 doesn’t “look” like Equation 3.3-17, in fact it is identical to it. We 
need only make the change of variable y Sw /b in Equation 3.3-18 to obtain Equation 3.3-17. 


Example 3.3-5 
(a game of Sic bo) In many jurisdictions in the United States, the taxes and fees from legal 
gambling parlors are used to finance public education, build roads, etc. Gambling parlors 
operate to make a profit and set the odds to their advantage. In the popular game of Sic 
bo, the player bets on the outcome of a simultaneous throw of three dice. Many bets are 
possible, each with a different payoff. Events that are more likely have a smaller payoff, 
while events that are less likely have a larger payoff. At one large gambling parlor the set 
odds are the following: 





1. Sum of three dice equals 4 or 17 (60 to 1) 

. Sum of three dice equals 5 or 16 (30 to 1); 

. Sum of three dice equals 6 or 15 (17 to 1); 

. Sum of three dice equals 7 or 14 (12 to 1); 

. Sum of three dice equals 8 or 13 (8 to 1); 

. Sum of three dice equals 9 or 10 or 11 or 12 (6 to 1). 


a oh Ww bo 


For example, 60 to 1 odds means that if the player bets one dollar and the event 
occurs, he/she gets 60 dollars back minus the dollar ante. It is of interest to calculate the 
probabilities of the various events. 


Solution All the outcomes involve the sum of three i.i.d. random variables. Let X,Y, Z 
denote the numbers that show up on the three dice, respectively. We can compute the 
result we need by two successive convolutions. Thus, for the sum on the faces of two 


dice, the PMF of X +Y, Pxiy(l) 2 gi Py(l— i)Py (i) and the result is shown in 
Figure 3.3-13. To compute the PMF of the sum of all three RVs Px+y+z(n), we perform 


Pepy @ 


6/36 
5/36 


4/36 
3/36 


2/36 


1/36 


12 3 4 5 6 7 8 9 10 11 12 
Sum on the faces of two dice, / 


Figure 3.3-13 Probabilities of getting a sum on the faces of two dice. 
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34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 


Figure 3.3-14 Probabilities of getting a sum on the faces of three dice. 


a second convolution as Pxyy4z(n) = y2, Pz(n — i)Px+y (i). The result is shown in 
Figure 3.3-14. From the second convolution, we obtain the probabilities of the events of 
interest. 

We define a “fair payout” (FP) as the return from the house that, on the average yields 
no loss or gain to the bettor.t If E is the event the bettor bets on, and the ante is $1.00, 
then for an FP the return should be 0, so 0 = —$1.00 + FP x P[E]. So FP = 1/P[E]. 

We read the results directly from Figure 3.3-14 to obtain the following: 


1. Getting a sum of 4 or 17 (you can bet on either but not both) has a win probability 
of 3/216 or a fair payout of 72:1 (compare with 60:1). 

2. Getting a sum of 5 or 16 (you can bet on either but not both) has a win probability 
of 6/216 or a fair payout of 36:1 (compare with 30:1). 

3. Getting a sum of 6 or 15 (you can bet on either but not both) has a win probability 
of 10/216 or a fair payout of 22:1 (compare with17:1). 

4. Getting a sum of 7 or 14 (you can bet on either but not both) has a win probability 
of 15/216 or a fair payout of 14:1 (compare with 12:1). 

5. Getting a sum of 8 or 13 (you can bet on either but not both) has a win probability 
of 21/216 or a fair payout of 10:1 (compare with 8:1). 


tObviously the house needs to make enough to cover its expenses for example, salaries, utilities, etc. 
The definition of a “fair payout” here ignores these niceties. Also the notion of average will be explored in 
some detail in Chapter 4. 
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6. Getting a sum of 9 or 12 (you can bet on either but not both) has a win probability 
of 25/216 or a fair payout of 9:1 (compare with 6:1). 

7. Getting a sum of 10 or 11 (you can bet on either but not both) has a win probability 
of 27/216 or a fair payout of 8:1 (compare with 6:1). 


Example 3.3-6 — Žž Ž > = > >o o 
(square-law detector) Let X and Y be independent RVs, both distributed as U(—1,1). 
Compute the pdf of V Ê (X +Y)?. 


Solution We solve this problem in two steps. First, we compute the pdf of Z 4x +Y; 
then we compute the pdf of V = Z?. Using the pulse-width one rect function (see def. on 
p. 170), we have 


fx(x) = rect (5) 
jut) = recs (2) 
fx-y= prect (z 7 2) . 





From Equation 3.3-14 we get 


fz(z) = if. rect (3) rect (=) dy. (3.3-19) 


oo 


The evaluation of Equation 3.3-19 is best done by keeping track graphically of where the 
“moving,” that is, z-dependent function rect((z—y)/2), is centered vis-a-vis the “stationary,” 
that is, z-independent function rect(y/2). The term moving is used because as z is varied, 
the function fx ((z—y)/2) has the appearance of moving past fy(y). The situation for four 
different values of z is shown in Figure 3.3-15. 

The evaluation of fz(z) for the four distinct regions is as follows: 


(a) z < —2. In this region there is no overlap so 


fz(z) = 0. 


(b) —2 < z < 0. In this region there is overlap in the interval (—1,z-+ 1) so 


z4+l1 
fa(z) = il, dy = F(z +2). 
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fy (z—y) fyly) 1 
f f 5 
l I I i 2 
I l I I 
i me i 
i i 4 i 
I ] I I 
z-1 z+1 —1 1 y 


z—1 —1 z+1 1 y 
(b) 
| l 
l i 
> 
l i 
I I 
-1 z—1 1 z+1 y 
(c) 
fyly) Fy(z-y) 
! oe ! 
m> 
! l l I 
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eS O ee a 
-1 1 z-1 z+1 y 
(d) 


Figure 3.3-15 Four distinct regions in the convolution of two uniform densities: (a) z < —2; 
(b) —2 < z < 0; (c)0 < z< 2; (d) z> 2. 


(c) 0< z <2. In this region there is overlap in the interval (z — 1,1) so 


tif w=ie-a 
z= 3j a 
(d), 2 < z. In this region there is-no overlap so 
fz(z) =0. 
If we put all these results together, we obtain 
1 z 
fz(2) = 3 (2— |el)rect (5) , (3.3-20) 

which is graphed in Figure 3.3-16. 
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fz) 


N|= 


-2 0 2 z 


Figure 3.3-16 The pdf of Z = X + Y for X, Yi.i.d. RVs uniform on (—1, 1). 


fy(v) 


0 1 4 v 
Figure 3.3-17 The pdf of V in Example 3.3-6. 


To complete the solution to this problem, we still need the pdf of V = Z?. We compute 
fv (v) using Equation 3.3-19 with g(z) = z?. For v > 0, the equation v — z? = 0 has two 
real roots, that is, z1 = yvu, z2 = — v0; for v < 0, there are no real roots. Hence, using 


Equation 3.3-20 in 
2 


fvl) =} fal%)/(2lz)) 
i=1 
yields 
i5- )) O0<v<4, 


0, otherwise, 


fy(v) = (3.3-21) 


which is shown in Figure 3.3-17. 








The pdf of the sum of discrete random variables can be computed by discrete convo- 
lution. For instance, let X and Y be two RVs that take on values 71,...,2%,... and 


Yis- - -Yj -, respectively. Then Z âx +Y is obviously discrete as well and the PMF is 
given by 
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Pz(Zn) = D Px,Y (Lk, Yj). (3.3-22) 


Tk tYj=Zn 


If X and Y are independent, Equation 3.3-22 becomes 


Pz(zn)= Ð Px(ae)Py(yj)= J Px (an) Pr (zn — te). (3.3-23a) 


Tebyj=2n Ik 


If the Zn’S and z;’s are equally spaced’ then Equation 3.3-23a is recognized as a discrete 
convolution, in which case it can be written as 


Pz(n) = Š. Px(k)Py(n— k). (3.3-23b) 
all k 


An illustration of the use of Equation 3.3-23b is given below. 


Example 3.3-7 
(sum of Bernoulli RVs) Let B, and Bz be two independent Bernoulli RVs with common 
PMF 





p,k=1, 
Ppg(k)= < q,k=0, whereg=1-p. 
0, else, 


Let M Ê B, + Bz and find the PMF Py (m). We start with the general result 


+00 
Pu(m) = X. Pp, (k)Ps,(m — k) 


b=—co 
= Ý Pa, (k)Pp,(m — k). 
b=0 


Since each B; can only take on values 0 and 1, the allowable values for M are 0, 1, and 2. 
For all other values of m, Pm(m) = 0. This can also be seen graphically from the discrete 
convolution illustration in Figure 3.3-18. 

Calculating the nonzero values of PMF Pum, we obtain 


Py (0) = Pg, (0)Ps, (0) = g? 
Py(1) = Pp, (0) Pe, (1) + Pp, (1)Pe, (0) = 2pq 
Py (2) = Pp, (1) Pp, (1) = p°. 


The student may notice that M is distributed as binomial b(k; 2, p). Why is this? What 
would happen if we summed in another independent Bernoulli RV? 


tFor example, let zn = nA, x, = kA, A a constant. 
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Slide to right 


b 


Figure 3.3-18 Illustration of discrete convolution of two Bernoulli PMFs. 


Example 3.3-8 — — — > 
(sum of Poisson RVs) Let X and Y be two independent Poisson RVs with PMFs Px (k) = 


heta and Py (i) = jehh, where a and b are the Poisson parameters for X and Y, 


respectively. Let Z £ X +Y. Then the PMF of Z , Pz(n) is given by 


n 


Pz(n) = > Px(k)Py (n — k) 





k=0 
“1 1 
— > L —-(a+b)gkpn-k (3 3-24) 
e a . . 
paar k! (n — k)! 
Recall the binomial theorem: 

n 

> (z) akpr—* = (a +b)”. (3.3-25) 


k=0 


Then 
1 T fn 
P. —(a+b) > kpn—k 
z(n) = Tye (a) et 


b n 
= GED o-ar, n>0, (3.3-26) 


which is the Poisson law with parameter a+b. Thus, we obtain the important result that the 


sum of two independent Poisson RVs with parameters a, b is a Poisson RV with parameter 
(a+b). 
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Example 3.3-9 SSS 
(sum of binomial random variables) A more challenging example than the previous one 
involves computing the sums of i.i.d. binomial RVs X and Y. Let Z = X +Y; then the 
PMF Pz(m) is given by 


Pz(m) = > Px (k)Py(m — k), 


k=—00 


where 


0, k<0, 
Px(k) = Prl) = f (R), OSB Sn, 
0, , k>n. 
Thus, 


min(n,m) 
- n m—k .n—(m— 
Pz(m) = D (gee (m) kgr-(m-k) 


k=max(0,m-n) 


min(n,m) 
— 7m,2n—m n n 


k=max(0,m—n) 


The limits on this summation come from the need to inforce both 0 < k < nand 0 < m—k < 
n, the latter being equivalent to m — n < k < m. Hence the range of the summation must 
be max(0,m — n) < k < min(n, m) as indicated. 


Somewhat amazingly 
min(n,m) 
n n 2n 
E DE) ae 


k=max(0,m-n) 


so that we get obtain the PMF of Z as 


Pz(m) = ( 23 ) pgm & bim 2m,p), (3.3-28) 
Thus, the sum of two i.i.d. binomial RVs each PMFs b(k; n, p) is a binomial RV with PMF 
given as b(k; 2n, p). 

To show that Equation 3.3-27 is true we first notice that the left-hand side (LHS) has 
the same value whether m > n (in which case the sum goes from k = m — n up to k = n) 
or whether m < n (in which case the sum goes from k = 0 up to k = m). A simple way to 
see this is to expand out the LHS in both cases. Indeed an expansion of the LHS for m < n 


C DEOEH e 


Doing the expansion in the case m > n to yields the same sum. 
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Proceeding with the verification of Equation 3.3-27, note that the number of subpopu- 
lations of size m that can be formed from a population of size 2n is C2". But another way 
to form these subpopulations is to break the population of size 2n into two populations 
of size n each. Call these two populations of size n each, A and B, respectively. Then the 
product CC? _, is the number of ways of choosing k subpopulations from A and m — k 
from B. Then clearly 


XO CRCR_, = O70 (3.3-30) 
k=0 


and since, as we said earlier, 


> CkCOm-k = 3 CEC bs! 


k n k=0 


the result in Equation 3.3-27 is equally valid when k goes from m — n to n. 

In Chapter 4 we will find a simpler method for showing that the sum of i.i.d. binomial 
RVs is binomial. The method uses transformations called moment generating functions 
and/or characteristic functions. 


We mentioned earlier in Section 3.2 that although the formula in Equation 3.3-23a 
(and its extensions to be discussed in Section 3.4) is very handy for solving problems 
of this type, the indirect approach is sometimes easier. We illustrate with the following 
example. 


Example 3.3-10 


(sum of squares) Let X and Y be i.i.d. RVs with X:N(0,o7). What is the pdf of Z £ 
X? +Y?? 


Solution We begin with the fundamental result given in Equation 3.3-2: 


Fz(z) = J fxy(z,y)dzdy for z>0 
(x,y)ECz 


1 2 2 2 
= —(1/20*)(2*-+y") - 
Jrg? JI. pee? dz dy. (3.3-31) 


The region C, consists of the shaded region in Figure 3.3-19. 
Equation 3.3-31 is easily evaluated using polar coordinates. Let 


x =rcos0 y=rsind 


dz dy — rdrdé. 


tThis formula can also be verified by using the change of variables | Ê m-—k in the RHS. The resulting 
sum will run from large to small, but reversing the summation order does not affect a sum. 
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Figure 3.3-19 The region G for the event {> + Y° < z} for z > 0. 


Then z? + y? < z — r < yz and Equation 3.3-31 is transformed into 


1 7 fy? 1 3 
Fz(z) = za | ao f Tr exp (~spa" ) ar 


= [1 — e7 2/22" )u(z) (3.3-32) 


and 


fz(z) = a) = zae uz). (3.3-33) 


Thus, Z = X? + Y? is an exponential RV if X and Y are i.i.d. zero-mean Gaussian. 


Example 3.3-11 
(squareroot of sum of squares) If the previous example is modified to finding the pdf of 


zê (X? + Y?)1/2, a radically different pdf results. Again we use Equation 3.3-2 except 
that now C; consists of the shaded region in Figure 3.3-20. 


Thus, 
Fela) = shy afr r)a 
zz — Ino o 0 exp 292 r 


= (1—e7? 2" )u(z) (3.3-34) 








Figure 3.3-20 The region C; for the event {(X° + Y°)!/? < z}. 
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Figure 3.3-21 Rayleigh and exponential pdf's. 


and 
fal2) = e7 ue), (3.3-35) 


which is the Rayleigh density function. It is also known as the Chi-square distribution with 
two degrees of freedom. The exponential and Rayleigh pdf’s are compared in Figure 3.3-21. 





Stephen O. Rice [3-3], who in the 1940s did pioneering work in the analysis of electrical 
noise, showed that narrow-band noise signals at center frequency w can be represented by 
the wave 


Z(t) = X coswt + Y sinwt, (3.3-36) 


where t is time, w is the radian frequency in radians per second and where X and Y are 
iid. RVs distributed as N(0,02). The so-called envelope Z £ (X? + Y?)!/2 has, therefore, 
a Rayleigh distribution with parameter ø. 

The next example generalizes the results of Example 3.3-10 and is a result of consider- 
able interest in communication theory. 





*Example 3.3-12t 
(the Rician density)! S. O. Rice considered a version of the following problem: Let X: 
N(P, o?) and Y: N(0,o7) be independent Gaussian RVs. What is the pdf of Z = 
(X? + Y?)!/2? Note that with power parameter P = 0, we obtain the solution of Example 
3.3-11. 

We write 


xa || exp |—= z= Pl", (2) dedy, z>0 
Fz(z) = 2mo? (z2+y?)!/2<z *P 2 o o TH, 2 , (3.3-37) 
0 z<0. 


, 





tStarred examples are somewhat more involved and can be omitted on a first reading. 
Sometimes called the Rice-Nakagami pdf in recognition of the work of Nakagami around the time of 
World War II. 
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The usual Cartesian-to-polar transformation z = rcos@, y = rsin@, r = (z2 + y2) 2, 


6 = tan”! (y/x) yields 


1 (P)? 
exp |=3 F 2 1 2 27 2 
no- PES) J e7 3(r/0) ( f eT P cos d/o ao) rdr -u(z). (3.3-38) 
0 


2ra? 


The function 2 
1 T 
Io(x) Ê — f er 78 dA 
2T 0 


is called the zero-order modified Bessel function of the first kind and is monotonically 
increasing like e”. With this notation, the cumbersome Equation 3.3-38 can be rewritten as 


exp [-3(2)"] p= /rP\ ieor 
Fz(z) = a | rl, (=) e 2 dr - u(z), (3.3-39) 
where the step function u(z) ensures that the above is valid for all z. To obtain fz(z) we 
differentiate with respect to z. This produces 


f2(z) = OP |-; (S5) Io (=) -u(z). (3.3-40) 


The pdf given in Equation 3.3-40 is called the Rician probability density. Since J,(0) = 1, 
we obtain the Rayleigh law when P = 0. When zP > o”, that is, the argument of J,(-) is 


large, we use the approximation 

et 
I,(a2) = ———> 
o(@) © rai 


to obtain 





1 z\V/2 4 2 
fz zia (=) e~ 3l(2-P)/e] 
(z) aa \P , 
which is almost Gaussian [except for the factor (z/P)1/?]. This is the pdf of the envelope 
of the sum of a strong sine wave and weak narrow-band Gaussian noise, a situation that 
occurs not infrequently in electrical communications. 





3.4 SOLVING PROBLEMS OF THE TYPE V = g( X,Y), W = h(X,Y) 


The problem of two functions of two random variables is essentially an extension of the 
earlier cases except that the algebra is somewhat more involved. 


Fundamental Problem 


We are given two RVs X, Y with joint pdf fxy(z,y) and two differentiable functions 
g(x,y) and h(z,y). Two new random variables are constructed according to V = g(X,Y), 
W = h(X,Y). How do we compute the joint CDF Fyy(v,w) (or joint pdf fyw (v, w)) of 
V and W? 
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Figure 3.4-1 A two-variable-to-two-variable matrixer. 


Illustrations. 1. The transformation shown in Figure 3.4-1 occurs in communication 
systems such as in the generation of stereo baseband systems [3-2]. The {a;} are gains. 
When a; = az = cos@ and az = a4 = sin8, the circuit is known as a 6-rotational trans- 
former. In another application if X and Y are used to represent, for example, the left and 
right pick-up signals in stereo broadcasting, then V and W represent the difference and 
sum signals if all the a,;’s are set to unity. The sum and difference signals are then used 
to generate the signal to be transmitted. Suppose for the moment that there are no source 
signals and that X and Y therefore represent only Gaussian noise. What is the pdf of V 
and W? 

2. The error in the landing location of a spacecraft from a prescribed point is denoted by 
(X, Y) in Cartesian coordinates. We wish to specify the error in polar coordinates V £ (X?+ 
yY?)1/2, W £ tan`!(Y/X). Given the joint pdf fxy(z,y) of landing error coordinates in 
Cartesian coordinates, how do we compute the pdf of the landing error in polar coordinates? 

The solution to the problem at hand is, as before, to find a point set Cyw such that 
the two events {V < v,W < w} and {(X,Y) € Cyy} are equal.t Thus, the fundamental 
relation is 


PIV <v, W < w] 4 Fyw(v,w) 


= f| tevey)dedy, (3.41) 
(z,y)ECow 
The region Cy, is given by the points z, y that satisfy 
Cow = {(x,y): g(2,y) <v, A(z, y) < wh. (3.4-2) 


We illustrate the application of Equation 3.4-1 with an example. 

Example 3.4-1 — > S 
(sum and difference) We are given V 2 X +Y and W Ê X -Y and wish to calculate the pdf 
fvw(v, w). The point set Cuw is described by the combined constraints g(z, y) â+ y <v 
and h(z,y) Se y < w; it is shown in Figure 3.4-2 for v > 0, w > 0. 


tIn more elaborate notation, we would write {C: V(¢) < v and W (C) < w} = {¢: (X(0), ¥(0) € Cow}. 
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Figure 3.4-2 Point set Cw (shaded region) for Example 3.4-1. 


The integration over the shaded region yields 
(v+w) /2 v-r 
Fywtow)= | (S7 terle vdy) dz. (3.43) 
—OO zw 
To obtain the joint density fyw (v, w), we use Equation 2.6-30. Hence 


2 Fyw (v, w) 
ðv Ow 


82 (v+w)/2 v-r 
= aa | (J fxy(z, y)dy) dx 


—=00 —w 


ə a {(v+w)/2 v—r 
= Jo Al (/ fxy(z, vav) dx 
a 1 (v—w) /2 v+w (v+w)/2 8 v—r 
= z ; fon fxy( J way + f (2 [. fxv(2, y)dy) dx 


a (u+w)/2 
= au J fxy(z,z — w)dz 


because the first integral is zero for continuous RVs X and Y, 


a petw)/2 
= fxy(z,x — w)dx 


1 vtw v- w 
= phy (355A) (3.4-4) 


fuw(v,w) = 
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where use has been made of the general formula for differentiation of an integral (see 
Appendix A.2.). Thus, even this simple problem, involving linear functions and just two 
RVs, requires a considerable amount of work and care to obtain the joint pdf solution. For 
this reason, problems of the type discussed in this section and their extensions to n RVs, 
that is, Yı = 91(%1,..-, Xn), Yo = g2(X1, .--, Xn) -< Yn = gn(X1,.--, Xn), are generally 
solved by the technique discussed next called the direct method for joint pdf evaluation. 
Essentially it is the two-dimensional extension of Equation 3.2-23. 


Obtaining fyw Directly from f xy 


Instead of attempting to find fyw(v,w) through Equation 3.4-1, we can instead take a 
different approach. Consider the elementary event 


{u<V <v+dv,w < W <w+dw} 


and the one-to-one! differentiable functions v = g(x,y), w = h(x, y). The inverse mappings 
exist and are given by z = ọ(v, w), y = (v, w). Later we shall consider the more general 
case where, possibly, more than one pair of (z;, y;) produce a given (v, w). 

The probability Pw < V < v+ dv,w < W < w + dw] is the probability that V 
and W lie in an infinitesimal rectangle of area du dw with vertices at (v, w), (v + dv, w), 
(v, w + dw), and (v + dv, w + dw). The image of this square in the z, y coordinate system 
ist an infinitesimal parallelogram with vertices at 


P, = (2,9); 

P = (2+ Fan, y+ a v); 

P = (2+ iwy maw) ; 

P= (z+ tbu aw wy + Édu + Edw). 


This mapping is shown in Figure 3.4-3. 

With Æ denoting the rectangular region shown in Figure 3.4-3(a) and Z denoting 
the parallelogram in Figure 3.4-3(b) and A(.#4) and A(Z) denoting the areas of .%@ and 
SZ respectively, we obtain 


Plu <V <v+dv,w<W < w+ du] = fJ fvw& magan (3.4-5) 
= fyw (v, w) AA) (3.4-6) 
= [| jrr & mdan (3.4-7) 
= fxy (zr, y) AF). (3.4-8) 


tEvery point (x, y) maps into a unique (v, w) and vice versa. 
See for example [3-4, p.769] 
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v = Constant 






v + dv= Constant 


w +dw = Constant 


w = Constant 





(a) (b) 


Figure 3.4-3 An infinitesimal rectangle in the v, w system (a) maps into an infinitesimal parallelogram 
(b) in the x, y system. 


Equation 3.4-5 follows from the fundamental relation given in Equation 3.4-1; Equation 
3.4-6 follows from the interpretation of the pdf given in Equation 2.4-6; Equation 3.4-7 
follows by definition of the point set .”, that is, 1% is the set of points that makes the 
events {(V,W) € Æ} and {(X,Y) € F } equal; and Equation 3.4-8 again follows from the 
interpretation of pdf. 

From Equations 3.4-6 and 3.4-8, we find that 


fuw(v, w) = es fxy(z, y), (3.4-9) 


where z = $(v, w) and y = %y(v, w). 

Essentially then, all that remains is to compute the ratio of the two areas. This is done in 
Appendix C. There we show that the ratio A(”)/A(.%) is the magnitude of a quantity called 
the Jacobian of the transformation z = (v, w), y = Y(v, w) and given the symbol J. If there 
is more than one solution to the equations v = g(x,y), w = A(z,y), say, xı = ¢)(v,w), 
yı = pilv, w), £2 = pa(v, w), yo = Po(v,w),.-.,2n = On(v,W), Yn = Vn(V, w), then Æ 
maps into multiple, disjoint infinitesimal regions .4%,.4%,...,.% and A(%)/A(%) = |Jil, 
i=1,...,n. The |J;| are often written as the magnitude of determinants, that is, 


0¢;/Ov  09,/Ow 
Oy, /Av Oy,/Ow 


The end result is the important formula 


|J;| = mag = |0¢,;/dv x OW,/Ow — Op,/Ov x O¢;/Ow|.  (3.4-10) 





fuw(v,w) = Y fav (zn vll. (3.4-11) 


i=1 
It is shown in Appendix C that |J7| = |Ji| £ l3g/Əx; x Oh/Oy; — Og/Oy; x Oh/Ox;|. Then 


we get the equally important formula 


fuw(v,w) =o fxv zi, y:)/lJil. (3.4-12) 
i=l 
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Example 3.4-2 
(linear functions) We are given two functions 


v Ê g(x,y) = 3z + 5y 
we h(a, y) = 2+ 2y (3.4-13) 


and the joint pdf fxy of two RVs X, Y. What is the joint pdf of two new random variables 
V =9(X,Y), W =A(X,Y)? 


Solution The inverse mappings are computed from Equation 3.4-13 to be 
x= o(v,w) = 2v — 5w 


y = (v, w) = —v + 3w. 





Th 
k OP 9 OO 5 X _ 1 g 
ðv |w av WC 
and 2 5 
|J| = mag 1 3|=2 





Assume fxy(z,y) = (2r)! exp[—$(x? + y”)]. Then, from Equation 3.4-11 


fvw(v u) = È exp -41e — 5w)? + (—v + au) 


1 
= = exp [5 — 26uw + su?) | . 
Thus, the transformation converts uncorrelated Gaussian RVs into correlated Gaussian RVs. 
Example 3.4-3 — = 
(two ordered random variables) Consider two i.i.d., continuous random variables with pdf’s 
fx, (2) = fx,(x) = fx(x). We define two new random variables as Yı = min(X1, X2) and 
Y = max(Xı, X2). Clearly Yı < Yat meaning that realizations of Y; are always less than 
realizations of Y2. We seek the joint pdf, fy,y,(y1, Y2), of Yi, Yz given that 


Yı = 9(X1, X2) = min( Xj, Xə) 

Y» = h(Xı, X2) = max(Xı, X2). 
Solution From Figure 3.4-4 (only the first quadrant is shown for convenience but all 
four quadrants must be considered in any calculation), we see that there are two disjoint 


real-number, regions and hence two solutions. We note that in 1, z1 > x2 while in 2z, 


İWe ignore the zero-probability event Yı = Y2. 
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x, =X, 





Figure 3.4-4 Showing the two regions of interest for Example 3.3-6. 


zı < z2. Thus, in Æı we have yi = £2, y2 = Tı or, in the g,h notation, yı = gı(£1, £2) = 
£2, yo = hy(21, 22) = zı. The Jacobian magnitude of this transformation is unity so that 
fyive (yi, y2) = fx, x2(ye, 41) = fx (ya) fx(y1), Y1 < Y2- 

Repeating this analysis for #2, we have yı = g2(%1,22) = z1, Y2 = ha(£1, 22) = £2 and 
once again the Jacobian magnitude of this transformation of unity. Hence fy,y,(y1,y2) = 
fx x(t, ¥2) = Fx (yi) fx (ye), Y1 < y2. As always we sum the solutions over the different 
roots/regions (here there are two) and obtain 


2 Z000 < < < %, 
fryz (y1: Y2) = { fxs) fx (ua) else. É i. 


Question for the reader: We know that X, and X> are independent; are Yı and Y2 indepen- 
dent? 


Example 3.4-4 
(marginal probabilities of ordered random variables) In the previous example we ordered 
two iid. RVs X1, Xə as Yi, Yo, where Yı < Yo. The joint pdf Yi, Y2 was shown to be 
fray: (yi, y2) = 2fx(y1) fx (ya), -—co < yı < yo < oo. Here we wish to obtain the marginal 
pdf’s of Yj, Yo. 





Solution 


To get fy, (y1) we have to integrate out fy, y,(y1, y2) = 2fx(y1) fx (y2), over all yo > yı. 
Hence 


(e 0] 

fru) = 2pe(ur) f fx(u2)dy2 =2fx (y1) (1 — Fx(y1)), —00 < yı < o. 
yı 

Likewise, to get fy, (y2) we integrate out fy, y, (y1, Y2) = 2fx (y1) fx (y2), over all yı < yo. 

The result is 


fral(u2) = 2fx (Y2) T fx(yı)dyı =2fx(y2)Fx (y2), ~œ < y2 < 00. 
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Example 3.4-5 
(the minimum and mazimum of two Normal random variables) We wish to see what the pdfs 
of ordered Normal RVs look like. To that end let X1, X2 be i.i.d. Normal N(0,1) RVs pdfs 
and define Yı a min(X,, X2) and Yz £ max(Xı, X2). Using the results of Example 3.4-5 we 
graph these pdf’s together with the Normal pdf on the same axes. The curves in Figure 3.4-5 
were obtained using the program Microsoft Excel’. The reader may want to duplicate these 
curves. 





pdfs of standard Normal, maximum of two standard Normals, 
and minimum of two standard Normals 


0.6 





Arugument 


Figure 3.4-5 The pdf of min(X1, X2) peaks at the left of the origin at —0.5 while the pdf of max( X1, X2) 
peaks at the right of the origin at 0.5. Note that Var[min(X,, X2)] = Var[min(Xi, X2)] < 1. 








3.5 ADDITIONAL EXAMPLES 


To enable the reader to become familiar with the methods discussed in Section 3.4, we 
present here a number of additional examples. 


Example 3.5-1 
(magnitude and angle) Consider the RVs 


V 4 9(X,Y) = VX24+Y? | (3.5-1) 





tan? (x) ; X >0, 
W =h(X,Y) = y (3.5-2a) 
tan”! (ž) +m, X <0. 


The RV V is called the magnitude or envelope while W is called the phase. Equation 3.5-2a 
has been written in this form because we seek a solution for w over a 27 interval and the 
inverse function tan~'(y/zx) has range (—7/2, 7/2) (i.e., its principle value). 


tExcel is available with Microsoft Office. The instruction to use Excel are available with the program. 
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To find the roots of 


v =y ty, w= 


\, xz > 0, 
) +n, «<0 B52) 


we observe that for z > 0, we have —5 < w < 5 and cosw > 0. Similarly, for x < 0, 
Z<w< 3" and cosw < 0. Hence the only solution to Equation 3.5-2b is 


z = vcosw Ê (v, w) 
y = vsinw 4 plv, w). 
The Jacobian J is given by 


j — 20) _ cosw —vsinw -v 
~ (v,w)  |sinw veosw | 





Hence the solution is, from Equation 3.4-11, 
fvw(v, w) = vfxy (v cos w, vsin w). (3.5-3) 


Suppose that X and Y are i.i.d. and distributed as N (0, ø?), that is, 





fxy(z,y) = ela? +9°)/207)_ 
, 2 
Then from Equation 3.5-3 
2 -v? /20? 1 0, -— Te 3r 
fuw(v,w) = (ae Jagr PP O-ZSw< 3 (3.5-4) 


T’ 
0, otherwise 


= fv (v)fw (w). 


Thus, V and W are independent random variables. The envelope V has a Rayleigh pdf and 
the phase W is uniform over a 27 interval. 


Example 3.5-2 
(magnitude and ratio) Consider now a modification of the previous problem. Let V = 4 
VX? +Y? and W = AY/X. Then with g(x,y) = y x? + y? and A(z, y) = y/z, the equations 





v—g(z,y) =0 
w—h(z,y) =0 
have two solutions: 
zı = v(1 + w?) !?, yı = wey 
z2 = —v(l + w?) T12, Yo = wx 


for —co < w < oo and v > 0, and no real solutions for v < 0. 
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A direct evaluation yields |J1| = |Jo| = (1 + w?)/v. Hence 
v 
fvw(v, w) = Ty wilfxy (1%) + fxy (x2, y2)]- 
With fxy(z,y) given by 


fxy(z,y) = 1 expl- (z? + y?)/207], 





2ra? 
we obtain 
v —v?/20? , 1/x 
fuw(v,w) = gee u(v) 1+ we 
= fv(v) fw (w). 


Thus, the random variables V, W are independent, with V Rayleigh distributed as in 
Example 3.5-1, and W Cauchy distributed. 


Example 3.5-3 
(rotation of coordinates) Let @ be a prescribed angle and consider the rotational transfor- 
mation 





vê X cos +Y sin 
W Ê X sin0 — Y cos 0 (3.5-5) 
with X and Y i.i.d. Gaussian, 


eo l(2?+y?)/207)_ 





fxy(z,y) = Ino? 


The only solution to 


v = z cos + ysin 


w = xsin — y cos 


z = vecos + wsin#d 


y = vsin — w cos ô. 


The Jacobian J is 


ðr Ox 
ðv w| _|cosð sind |_ 4 
dy Oy| |sin®d —cosO| ` 
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Hence 





1 2 2 2 
fvw(v, w) = na ele tw )/20 l, 
Thus, under the rotational transformation V = g(X,Y), W = A(X,Y) given in Equa- 
tion 3.5-5, V and W are i.i.d. Gaussian RVs just like X and Y. If X and Y are Gaussian 
but not independent RVs, it is still possible to find a transformation! so that V, W will be 
independent Gaussians if the joint pdf of X, Y is Gaussian (Normal). 


Example 3.5-4 . 
Consider again the problem of solving for the pdf of Z = vV X? + Y? as in Example 3.3-11. 
This time we shall use Equation 3.4-11 to somewhat indirectly compute fz(z). First we 
note that Z = vX? + Y? is one function of two RVs while Equation 3.4-11 applies to two 
functions of two RVs. To convert from one kind of problem to the other, we introduce an 





auziliary variable W Ê X. Then 
Z Ê g(X,Y) = VX? +Y? 
W ARX,Y) =X. 
The equations 
z—g(z,y) =0 
w—h(z,y) =0 
have two real roots for |w| < z, namely 
zı =w z2 =w 
yı = V z? — w? yo = —V 22 — w?. 
At both roots, |J| has the same value: 


= = z 
Hence a direct application of Equation 3.4-11 yields 
z 
z, w) — z1, Y1) + z2, Y2)}- 
fzw( zama. 1,41) + fxy (£2, y2)] 


Now assume that 
e- [(2?+y7)/207] | 





fxy(z,y) = Ino? 


Then, since in this case fxy(z,y) = fxy(z,—y), we obtain 
1 ká en? /207 


faw(z,w) = ¢ xo? /z2 — w? 
0, otherwise. 


z>0,|w| < z, 


tSee Chapter 5 on random vectors. 


216 Chapter 3 Functions of Random Variables 











LZ) 


z? 


— w2 
Figure 3.5-1 Trigonometric transformation w = zsin ô. 


However, we don’t really want fzw (z, w), but only the marginal pdf fz(z). To obtain this, 
we use Equation 2.6-47 (with x replaced by z and y replaced by w). This gives 


fete) = f few(u)dw 
= 2 e77 /20° É [ | u(z).- 


The term in parentheses has value unity. To see this consider the triangle in 


Figure 3.5-1 and let w £ zsin@. Then dw = zcos0d0 and |z? — w?]!/2 = zcos@ and 


the term in parentheses becomes 


z n/2 
“| z=} do =1. 
T Jo v2? -— w? T Jo 


falz) = Ge?" ue), 


Hence 


which is the same result as obtained in Equation 3.3-33, obtained there by a different 
method. 


Example 3.5-5 
(sum and difference again) Finally, let us return to the problem considered in Example 3.4-1: 


vaxdy 








w4x-y. 
The only root to 
vu—(r#+y)=0 
w—(x-y)=0 
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and |J| = 3. Hence 





fuw(v,w) = Sfx (=, 27”) - 


We verify in passing that 
oo 
ful) =f fvw wdw 


~~ 1 v+tw v-—w Avt+tw 
= — — d ith = 
[i FIG ) w, with z 5 








= T fxy(z,v — z)dz. 


This important result on the sum of two RVs was derived in Section 3.3 (Equation 3.3-12) 
by different means. 








SUMMARY 


The material in this chapter discussed functions of random variables, a subject of great 
significance in applied science and engineering and fundamental to the study of random 
processes. The basic problem dealt with computing the probability law of an output random 
variable Y produced by a system transformation g operating on an input random variable 
X (i.e., Y = g(X)). The problem was then extended to two input random variables X, Y 
being operated upon by system transformations g and h to produce two output random 
variables V = g(X,Y) and W = A(X,Y). Then the problem is to compute the joint pdf 
(PMF) of V, W from the joint pdf (PMF) of X, Y. 

We showed how most problems involving functions of RVs could be computed in at 
least two ways: 


1. the so-called indirect approach through the CDF; and 
2. directly through the use of a “turn-the-crank” direct method. 


A number of important problems involving transformations of random variables were worked 
out including computing the pdf (and PMF) of the sum of two random variables, a problem 
which has numerous applications in science and engineering where unwanted additive noise 
contaminates a desired signal or measurement. For example, the so-called “signal and addi- 
tive noise problem” is a seminal issue in communications engineering. 

Later, when we extend the analysis of the sum of two independent random variables to 
the sum of n independent random variables, we will begin to observe that the CDF of the 
sum starts to “look like” the CDF of a Normal random variable. This fundamental result, 
that is, convergence to the CDF of the Normal, is called the Central Limit Theorem, and is 
discussed in Chapter 4. 

Finally we considered how to compute the pdf of two ordered random variables. We 
found we could do this using the powerful so-called direct method for computing distributions 
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of RVs fuctionally related to other RVs. Later, in Chapter 5 on random vectors, we will 
discuss transformations involving n ordered RVs. Ordered random variables appear in a 
branch of statistics called nonparametric statistics and often yield results that are inde- 
pendent of underlying distributions. In this sense, ordered random variables yield a certain 
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level of robustness to expressions derived about them. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 


reading. ) 


3.1 


3.2 


3.3 


3.4 


Let X have CDF Fx (x) and consider Y = aX + b, where a < 0. Show that if X is 


not a continuous RV, then Equation 3.2-3 should be modified to 
—b —b 
Fy(y) =1—Fx (=) +P Ix = | 
a a 
—b —b 
=1-— Fy (=) + Px( 2). 
a a 


Let Y be a function of the RV X as follows: 


yafx, x20, 
TX, X<0. 


Compute fy (y) in terms of fx(x). Assume that X:N (0,1). 


fx (a) = ei”. 


Vin 


Let X be a random variable uniform over 


(a) (—7/2,x/2). Compute the pdf of Y = tan X. 
(b) (0, 1). Compute the pdf of Y = e*. 


Let Y be a function of the random variable X as follows: 


yal X, X20, 
=) 2x2 x <0. 


Compute pdf fy (y) in terms of pdf fx(z). Let fx(zx) be given by 


1,,2 


fx(e) = nee 


that is, X:N(0, 2). 
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3.5 


3.6 


*3.7 


*3.8 


*3.9 


3.10 


3.11 


Let X have pdf 
—ar 


fx(x) = ae~**u(z). 


Compute the pdf of (a) Y = X3; (b) Y = 3X 4+ 2. 
Let X be a Laplacian random variable with pdf 


1 
fx(z) = 5¢ — oœ < £ < +00. 


Let Y = g(X), where g(-) is the nonlinear function given as the saturable limiter 


-1, «<-l, 
g(x) 22 z, -1<r<+, 
+1, z>+l. 


Find the distribution function Fy (y). 

In medical imaging such as computer tomography, the relation between detector 
readings y and body absorptivity x follows a y = e” law. Let X:N (u, 0?); compute 
the pdf of Y. This distribution of Y is called lognormal. The lognormal random 
variable has been found quite useful for modeling failure rates of semiconductors, 
among many other uses. 


In the previous problem you found that if X : N(u,o7), then Y 4 exp X has a 
lognormal density or pdf 





2 
fy(y) = Tea exp [fe u(y). 


(a) Sketch the lognormal density for a couple of values of u and ø. 

(b) What is the distribution function of the lognormal random variable Y? Express 
your answer in terms of our erf function. Hint: There are two possible approaches. 
You can use the method of substitution to integrate the above density, or you 
can find the distribution function of Y directly as a transformation of random 
variable problem. 


In homomorphic image processing, images are enhanced by applying nonlinear trans- 

formations to the image functions. Assume that the image function is modeled as 

RV X and the enhanced image Y is Y = ln X. Note that X cannot assume negative 
1 


values. Compute the pdf of Y if X has an exponential density fx(x) = e7 37 u(x). 
Assume that X:N(0,1) and let Y be defined by 


y- ÍVX, X20, 
~ 10, X <0. 


Compute the pdf of Y. 


(a) Let X:N(0, 1) and let Y 4 g(X), where the function g is shown in Figure P3.11. 
Use the indirect approach to compute Fy(y) and fy(y) from fx(x). (b) Compute 


220 Chapter 3 Functions of Random Variables 








fy(y) from Equation 3.2-23. Why can’t Equation 3.2-23 be used to compute fy (y) 
at y = 0,1? 


g(x) 


1 2 x 


Figure P3.11 


3.12 Let X:U([0, 2]. Compute the pdf of Y if Y = g(X), where the function g is plotted 
in Figure P3.12. 


g(x) 


1 1 
2 


Figure P3.12 


3.13 Let X:U[0,2], Compute the pdf of Y if Y = g(X) with the function g as shown in 
Figure P3.13. 


g(x) 





Figure P3.13 
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3.14 


3.15 


3.16 


3.17 


3.18 


3.19 


3.20 
3.21 
3.22 


3.23 





Let the RV X:N(0, 1), that is, X is Gaussian with pdf 


z? 


e?, —o0 < z < +00. 








1 
r= 
f x ( ) Jin 
Let Y = g(X), where g is the nonlinear function given as 


-1, x<-l, 
g(s) z, -1<2<1, 
1, z>l. 


It is called a saturable limiter function. (a) Sketch g(x); (b) find Fy (y); (c) find and 
sketch fy (y). 

Let X~N(p, 07) and let Y = aX + b. Show that Y~N(ap + b, a?o?) and find the 
values of a and b so that Y~N(0, 1). 


Let Y Ê sec X. Compute fy (y) in terms of fx(xr). What is fy(y) when fx(z) is 
uniform in (—2, 7]? 

Consider two random variables X and Y with the joint pdf fx,y (x,y). Determine 
the pdf of Z = XY. Repeat for the case when X and Y are independent uniform 
random variables over (0, 1). 

Let X and Y be independent and identically distributed exponential RVs with 


fx(z) = fy (2) = ae“ u(z). 


Compute the pdf of Z Sy_x. 
Let random variables X and Y be described by the given joint pdf fx y(z, y). Define 
new random variables as 


VAX4+Y and W22X-Y. 


(a) Find the joint pdf fy,w (v, w) in terms of the joint pdf fx,y(z,y). 
(b) Show, using the results of part (a) or in any other valid way, that under 
suitable conditions 


+00 
fa(2) = J fx(2)fy(z — z)dz, 


for ZÊ xX +Y. What are the suitable conditions? 


Repeat Example 3.2-11 for fx(x) = e~7u(z). 

Repeat Example 3.2-12 for fx (x) = e7*u(z). 

The objective is to generate numbers from the pdf shown in Figure P3.22. All 
that is available is a random number generator that generates numbers uniformly 
distributed in (0,1). Explain what procedure you would use to meet the objective. 
It is desired to generate zero-mean Gaussian numbers. All that is available is a 
random number generator that generates numbers uniformly distributed on (0,1). 
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*3.24 


3.25 
3.26 


3.27 


F(x) 


-1 1 
Figure P3.22 


It has been suggested Gaussian numbers might be generated by adding 12 uniformly 
distributed numbers and subtracting 6 from the sum. Write a program in which you 
use the procedure to generate 10,000 numbers and plot a histogram of your result. 
A histogram is a bar graph that has bins along the z-axis and number of points in 
the bin along the y-axis. Choose 200 bins of width 0.1 to span the range from —10 
to 10. In what region of the histogram does the data look most Gaussian? Where 
does it look least Gaussian? Give an explanation of why this approach works. 
Random number generators on computers often provide a basic uniform random 
variable X: U[0, 1]. This problem explores how to get more general distributions by 
transformation of such an X. 


(a) Consider the Laplacian density fy (y) = į exp(—cly|), —oo < y < +00, with 
parameter c > 0, that often arises in image processing problems. Find the 
corresponding Laplacian distribution function Fy (y) for —oo < y < +00. 

(b) Consider the transformation 


z= g(x) = Fy (2), 


using the distribution function you found in part (a). Note that Fọ' denotes 
an inverse function. Show that the resulting random variable Z = g(X) will 
have the Laplacian distribution with parameter c if X: U[0, 1]. Note also that 
this general result does not depend on the Laplacian distribution function 
other than that it has an inverse. 

(c) What are the limitations of this transform approach? Specifically, will it work 
with mixed random variables? Will it work with distribution functions that 
have flat regions? Will it work with discrete random variables? 


In Problem 3.18 compute the pdf of |Z]. 
Let X and Y be independent, continuous RVs. Let Z = min(X,Y). Compute Fz(z) 
and fz(z). Sketch the result if X and Y are distributed as U(0,1). Repeat for the 
exponential density fx(x) = fy (xz) = aexp|—az] - u(x). 
Let X and Y be two random variables with the joint pdf fxy(z,y) and joint CDF 
Fxy(z,y). Let Z = max(X,Y). 

(a) Find the CDF of Z. 

(b) Find the pdf of Z if X and Y are independent. 


Discuss if X and Y are independent and identical exponential variates with mean p. 
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3.28 


3.29 


3.30 


3.31 


3.32 


3.33 


Let X and Y be two random variables with a joint pdf fx,y (z, y). 

Let R= VX? +Y?, O = tan 1(Y/X). Find fre(r, 6) in terms of fxy(z,y). 

Let X, Y and Z be independent standard normal random variables. 

Let W = (X? + Y? + Z?)’2. Find the pdf of W. 

Let X1, X2, ..., Xn be n i.i.d. exponential random variables with fx,(z) = e~*u(z). 
Compute an explicit expression for the pdf of Zn = max(X1, X2,..., Xn). Sketch 
the pdf for n = 3. 

Let X1, X2,..., Xn be n i.i.d. exponential random variables with fx, (£) = e "u(z). 
Compute an explicit expression for the pdf of Zn = min(X1, X2,..., Xn). Sketch 
the pdf for n = 3. 

Let X, Y be iid. as U(—1,1). Compute and sketch the pdf of Z for the system 
shown in Figure P3.31. The square-root operation is valid only for positive numbers. 
Otherwise the output of the is zero. 





Figure P3.31 A square-root device. 


The length of time, Z, an airplane can fly is given by Z = aX, where X is the 
amount of fuel in its tank and a > 0 is a constant of proportionality. Suppose a plane 
has two independent fuel tanks so that when one gets empty the other switches on 
automatically. Because of lax maintenance a plane takes off with neither of its fuel 
tanks checked. Let Xı be the fuel in the first tank and X3 the fuel in the second 
tank. Let Xı and X2 be modeled as uniform i.i.d. RVs with pdf fx, (x) = fx,(z) = 
i [u(x) — u(x — b)]. Compute the pdf of Z, the maximum flying time of the plane. If 
b = 100, say in liters, and a = 1 hour/10 liters, what is the probability that the 
plane will fly at least five hours? 

Let X and Y be two independent Poisson RVs with PMFs 


1 
Px(k) = Ge ?2Fu(k) and (3.5-6) 

1 
Py(k) = zi e33*u(k), respectively. (3.5-7) 


Compute P[Z < 4], where Z2X+Y. | Hint: > H ) aib"-F = (a +5)”. 
j=0 
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3.34 Given two random variables X and Y that are independent and uniformly distributed 
as U(0,1): 
(a) Find the joint pdf fy,y of random variables U and V defined as: 


us JX +Y) and 


I> 


1 
v 2 5(x-¥). 


(b) Sketch the support of fy,v in the (u, v) plane. Remember support of a function 
is the subset of its domain for which the function takes on nonzero values. 


3.35 Let X and Y be independent random variables with pdf fx (z) = e,z > 0 and 
zero otherwise, and fy(y) =e”, y > 0 and zero otherwise. Compute (a) the pdf of 
Z = *+¥ (b) the pdf of Z = X —Y. 
3.36 Compute the joint pdf fzw (z, w) if 
Zx +y? 
wx 


when 


2 2 
e ((e*+9°)/20") Loo < T < 00, —00 < y < o. 





1 
fxy(z,y) = Ing? 


Then compute the fz(z) from your results. 
*3.37 Consider the transformation 


Z=aX + bY 
W=cX + ay. 
Let 
1 
— —Q(z,y) 
T, = e 3 
fxr = aioe 
where 
Qla.) = gag nl - Pony +7? 


What combination of coefficients a, b, c, d will enable Z, W to be independent 
Gaussian RVs? 


3.38 Let 
1 2 2 2 
fxv(.y) = — e |- (ms) l 


2rV/1— p? xp 2(1 — p?) 


Compute the joint pdf fyw(v, w) of 
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V =4(X?+Y?) (3.5-8) 
W =3(xX?-Y”). (3.5-9) 


3.39 Derive Equation 3.4-4 by the direct method, that is, using Equations 3.4-11 or 3.4-12. 
3.40 Consider the transformation 


Z = X cos +Y sind (3.5-10) 
W = Xsin@ — Y cosé. (3.5-11) 


Compute the joint pdf fzw(z,w) in terms of fxy(z,y) if 
1 h(a? 4 2) 
fxy(2,y) = 3E a(t ty") oo < g£ <00,-00 <y < oO. 


(It may be helpful to note that this transformation is a rotation by +0 followed by 
a negation on W.) 
3.41 Compute the joint pdf of 


Zx? +y? 


w Sey 


when 


en l(a? +9?)/207)_ 





fxy(z,y) = Ing? 


3.42 If X and Y are independent random variables which are uniformly distributed over 
(0, 1), find the joint pdf and hence the marginal pdf of U = X +Y, V = X -Y. 

3.43 Consider the input-output view mentioned in Section 3.1. Let the underlying exper- 
iment be observations on an RV X, which is the input to a system that generates 
an output Y = g(X). 


(a) What is the range of Y? 

(b) What are reasonable probability spaces for X and Y? 

(c) What subset of R! consists of the event {Y < y}? 

(d) What is the inverse image under Y of the event (—oo, y) if Y = 2X + 3? 


3.44 In the diagram shown in Figure P3.44, it is attempted to deliver the signal X from 
points a to b. The two links L1 and L2 operate independently, with times-to-failure 
Tı, T2, respectively, which are exponentially and identically distributed with rate 
A (>0). Set Y = 0 if both links fail. Denote the output by Y and compute Fy (y, t), 
the CDF of Y at time t. Show for any fixed t that Fy (co, t) = 1. 
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L2 


Figure P3.44 parallel links. 
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P Expectation and Moments 


4.1 EXPECTED VALUE OF A RANDOM VARIABLE 


It is often desirable to summarize certain properties of an RV and its probability law by a 
few numbers. Such numbers are furnished to us by the various averages, or expectations of 
an RV; the term moments is often used to describe a broad class of averages, and we shall 
use it later. 

We are all familiar with the notion of the average of a set of numbers, for example, the 
average class grade for an exam, the average height and weight of children at age five, the 
average lifetime of men versus women, and the like. Basically, we compute the average of a 
set of numbers z1, 22,...,2£N as follows: 


A 
Us = N DEA (4.1-1) 
i=1 
where the subscript s is a reminder that u, is the average of a set. 
The average u, of a set of numbers z1, £2,...,£%Ẹ can be viewed as the “center of 


gravity” of the set. More precisely the average is the number that is simultaneously closest 
to all the numbers in the set in the sense that the sum of the distances from it to all the 
points in the set is smallest. To demonstrate this we need only ask what number z minimizes 
the summed distance D or summed distance-square D? to all the points. Thus with 


N 
p £ Siz - zi)’, 
i=l 


227 


228 Chapter 4 Expectation and Moments 





the minimum occurs when dD?/dz = 0 or 


N 
=2Nz-2 a; =0, 


i=1 


dD? 
dz 


which implies that 
IA 
z= W, 5 Sai. 
N t=1 


Note that each number in Equation 4.1-1 is given the same weight (i.e., each z; is multiplied 
by the same factor 1/N). If, for some reason we wish to give some numbers more weight 
than others when computing the average, we then obtain a weighted average. However, we 
won’t pursue the idea of a weighted average any further in this chapter. 

Although the average as given in Equation 4.1-1 gives us the “most likely” value or the 
“center of gravity” of the set, it does not tell us how much the numbers spread or deviate 
from the average. For example, the sets of numbers Sı = {0.9,0.98,0.95, 1.1, 1.02, 1.05} 
and S2 = {0.2, —3, 1.8, 2,4, 1} have the same average but the spread of the numbers in S2 
is much greater than that of Sı. An average that summarizes this spread is the standard 
deviation of the set, Os, computed from 


LS 1/2 
Os = È Die = m| . (4.1-2) 


Equations 4.1-1 and 4.1-2, important as they are, fall far short of disclosing the usefulness 
of averages. To exploit the full range of applications of averages, we must develop a calculus 
of averages from probability theory. 

Consider a probability space (9, Z P) associated with an experiment Æ and a discrete 
RV X. Associated with each outcome ¢, of Z, there is a value X(¢;) 4 zi, which the RV 
X takes on. Let z1, 22,...,2.4 be the M distinct values that X can take. Now assume that 
X is repeated N times and let z{*) be the observed outcome at the kth trial. Note that 
z‘*) must assume one of the numbers z1,..., £M. Suppose that in the N trials zı occurs 
nı times, x2 occurs ng times, and so forth. Then for N large, we can estimate the average 
value ux of X from the formula 


(k) (4.1-3) 


1 N 

k=1 

1 M 
x Nm 
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M ni 
sYa (Mt) (4.1-4) 


~ J n P[X = zi]. (4.1-5) 


Example 4.1-1 sss 
(loaded dice) We observe 17 tosses of a loaded die. Here N = 17, M = 6 (the six faces of the 
die.) The observations are {1,3,3,1,2,1,3,2,1,1,2,4,1,1,5,3,6}. Let P[i] denote the probability 
of observing the face with the number i on it. Then from the observational data we get 


P{l] ~ 7/17; P[2] ~ 3/17; P[3] = 4/17; P[4] ~ 1/17; P[5] = 1/17; P[6] = 1/17. 


These estimates of the “true” probabilities are quite unreliable, however. To get more reliable 
date we would have to greatly increase the number of tosses. We might ask what are the 
“true ” probabilities anyway. One answer might be that the Laws of Nature have imbued 
the die with an inherent set of probabilities that must be determined by experimentation. 
Another view is that the true probabilities are the ratios P[i] = n;/N you get when N 
becomes arbitrarily large. However what is meant by arbitrarily large? For any finite values 
of N the estimated probabilities will always change as we increase N. These conundrums 
are mostly resolved by statistics discussed in some detail Chapters 6 and 7. 





Equation 4.1-5, which follows from the frequency definition of probability, leads us to our 
first definition. 


Definition 4.1-1 The expected or average value of a discrete RV X taking on values 
x, with PMF Px (2;) is defined by 


EIX] J z:Px(z:). m (4.1-6) 


As given, the expectation is computed in the probability space generated by the RV. We can 
also compute the expectation by summing over all points of the discrete sample space, that 
is, E[X] = o X (Ci) P[{¢,}], where the ¢; are the discrete outcome points in the sample 
space Q. 

A definition that applies to both continuous and discrete RVs is the following: 


Definition 4.1-2 The expected value or mean, if it exists,t of a real RV X with pdf 
fx(zx) is defined by 
OO 
E[X] = f zfx(z)dr. m (4.1-7) 
Here, as well as in Definition 4.1-1, the expectation can be computed in the original proba- 
bility space. If the sample description space is not discrete but continuous, for example, an 


İt The expected value will exist if the integral is absolutely convergent, that is, if J ee |z| fx (2) dz < œ. 
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uncountable infinite set of outcomes such as the real line. Then E[X] = fo X(¢)P[{d¢}], 
where P[{d¢}] is the probability of the infinitesimal event {¢ < ¢’ < Ç + d¢}. 

The symbols E[X], X, px, or simply p are often used interchangeably for the expected 
value of X. Consider now a function of an RV, say, Y = g(X). The expected value of Y is, 
from Equation 4.1-7, 


EY] = J j yfy (y)dy. (4.1-8) 


However, Equation 4.1-8 requires computing fy(y) from fx(x). If all we want is E[Y], 
is there a way to compute it without first computing fy(y)? The answer is given by 
Theorem 4.1-1 which follows. 


Theorem 4.1-1 The expected value of Y = g(X) can be computed from 


EY]= f” ale)fx(a)de, (4.1-9) 


—00 
where g is a measurable (Borel) function.t Equation 4.1-9 is an important result in the theory 
of probability. A rigorous proof of Equation 4.1-9 requires some knowledge of Lebesgue 


integration; we offer instead an informal argument below to argue that Equation 4.1-9 is 
valid.t 


On the Validity of Equation 4.1-8 
Recall from Section 3.2 that if Y = g(X) then for any y; (Figure 4.1-1) 
T k k 
{ui < Y < yj + Ay} = {2P < X <2 4 Ac}, (4.1-10) 
k=1 


where r; is the number of real roots of the equation y; — g(x) = 0, that is, 
y; = g(a) =... = g(a). (4.1-11) 


The equal sign in Equation 4.1-10 means that the underlying event is the same for both 
mappings X and Y. Hence the probabilities of the events on either side of the equal sign 
are equal. The events on the right side of Equation 4.1-10 are disjoint and therefore the 
probability of the union is the sum of the probabilities of the individual events. Now partition 


+See definition of a measurable function in Section 3.1. 
+See Feller [4-1, p.5] or Davenport [4-2, p.223] 
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1 1 2 ) {r) 
xj x;' Vy A x;! ) x2 -1A x;)| x! ) x," x" + Ax; r 


Figure 4.1-1 Equivalence between the events given in Equation 4.1-10. 


the y-axis into many fine subintervals y1, Y2,- - . , Yj, -- - - Then, approximating Equation 4.1-8 
with a Riemann? sum and using Equation 2.4-6, we can write! 


ElY] = J * yfy (y)dy 


00 


m 
~ yPly; <Y <y; + Ay] 
j=l 
Tj 


g(a PP < X <a + Aaf), (41-12) 
1 


The last line of Equation 4.1-12 is obtained with the help of Equations 4.1-10 and 4.1-11. 
But the points z are distinct, so that the cumbersome double indices j and k can be 


replaced with a single subscript index, say, i, The Equation 4.1-12 becomes 


ElY] ~ Y5 ole) Phe: < X < r; + Az], 


i=1 
and as Ay, Az — 0 we obtain the exact result that 
OO 
Ely] = J g(x) fx (x) de. (4.1-13) 
00 
Equation 4.1-13 follows from the Riemann sum approximation and Equation 2.4-6; the z; 


have been ordered in increasing order x, < 22 < T3 <.... 


tBernhard Riemann (1826-1866). German mathematician who made numerous contributions to the 
theory of integration. 
tThe argument follows that of Papoulis [4-3, p.141] 
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In the special case where X is a discrete RV, 


EIY] = X 9(2i) Px (zi). (4.1-14) 


This result follows immediately from Equation 4.1-13, since the pdf of a discrete RV involves 
delta functions that have the property given in Equation B.3-1 in Appendix B. E 





Example 4.1-2 
(expected value of Gaussian) Let X : N (u, 0°), read “X is distributed as Normal with para- 
meters u and o?.” The expected value or mean of X is 


°° 1 1 (a-p\? 
Let z Ê (x — p)/o. Then 


[e 6] 1 oo 
sx- -f etatu (z | e-i az). 
—oo — o0 


The first term is zero because the integrand is odd, and the second term is yz because the 
term in parentheses is P[Z < 00], which is the certain event for Z : N(0,1). Hence 


E{X)=p for X: N(u, 07). 


Thus, the parameter u in N(,07) is indeed the expected or mean value of X as claimed 
in Section 2.4. 


Example 4.1-3  — >> o 
(expected value of Bernoulli RV) Assume that the RV B is Bernoulli distributed taking on 
value 1 with probability p and 0 with probability q = 1 — p. Then the PMF is given as 


p, when k = 1, 
Pp(k) =< q, when k=O, 
0, else. 
The expected value is then given as 
+00 
E[B]= $. kPa(k) 
b=—00 
=1p+0 
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Example 4.1-4 
(expected value of binomial RV) Assume that the RV K is binomial distributed with PMF 
Px(k) = 0(k; 7, p). Then we calculate the expected value as 





+00 
E|K]= Ý. kPx(k) 


k=~—0o 


k= 
-5h70 — 
k 
k=0 
< n 
= k p 1 — p n—k 
2 Kin mi? OTP) 
= s n! n—k 
=>, k-n k!” (1 -p) 
k=1 
n—l n! N 
= nk’ +1 (74 — p-k -1 ith kKÊk— 
k(n — kl — pi? (1—p) with k’=k-1, 
k'=0 
n—1 
= (n 7 1)! k’ n—-1—k’ 
= NP (= Bin — 1- ki? (1—p) 
k’=0 
n—l 
= np, since the sum in the round brackets is 5 b(k;n—1,p)=1. 
k=0 





Example 4.1-5 
(more on multiple lottery tickets) We continue Example 1.9-6 of Chapter 1 on whether 
it is better to buy 50 tickets from a single lottery or 1 ticket each from 50 successive 
lotteries, all independent and with the same fair odds. Here we are interested in the mean 
or expected return in each case. Again each lottery has 100 tickets at $1 each and the 
fair payoff is $100 to the winner. For the single lottery, we remember the odds of winning 
are 50 percent, so the expected payoff is $50. For the 50 plays in separate lotteries, we 
recall that the number of wins K is binomial distributed as b(k; 50, 0.01), so the mean value 
E|K] = np = 50 x 0.01 = 0.5. Since the payoff would be $100K, the average payoff would 
be $50, same as in the single lottery. 


Example 4.1-6 
(expected value of Poisson) Let K be a Poisson RV with parameter a > 0. Then 


o0 e72 
E[k]=>— ka" 
k=0 ` 
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=a. (4.1-15) 


Thus, the expected value of Poisson RV is the parameter a. 


Linearity of expectation. When we regard mathematical expectation E as a operator, it 
is relatively easy to see it is a linear operator. For any X, consider 


N +oo f N 
E ao] = f l: a)) fx(x)dx (4.1-16) 
N atoo 
=Y | wle)fx(ear (411-17) 
N 
= J Elgi(X)) (4.1-18) 


t=1 


provided that these exist. The expectation operator Æ is also linear for the sum of two RVs: 


+00 p+oo 
E[X +Y] = J J (x +y)fx,y (z, y)dzdy 
+00 +00 +00 +00 
= J J afxy(,y)dedy + J J ufx.y (2, y)dzdy 


= [os ( ve fev (e.a)dy) dx + i y ( "= fx,y(z, väz) dy 


00 —co —oo —co 


+00 oo 
= J tfx(x)dx + J yfy (x)dy 


=o —0o 


= E[X]+ E[Y]. 


The reader will notice that this result can readily be extended to the sum of N RVs 
Xi, Xo, aes ,XN. Thus, 


N N 
E > x => > E[Xi]. (4.1-19) 
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Note that independence is not required. We can summarize both linearity results by saying 
that the mathematical expectation operator E distributes over a sum of RVs. 


Example 4.1-7 — SSS 
(variance of Gaussian) Let X : N (u, 0?) and consider the zero-mean RV X — p with variance 
E(X — p)?] = E[X? — 2uX + p?] = E[X?] — p? = Var[X] by the linearity of expectation 
E. We can write 





+00 1 2 —p)2 
B(x —w)]= [ @-n? qane Sa ae 
2 +00 -2 
= Tm J z2e-7dz with substitution z Ê (z — p)/F. 


z2 


22 » . 
Next we integrate by parts with u = z and dv = ze” T dz, yielding du = dz and v = —e7 7, 
so that, the above integral becomes 


= —0 + 0 + Van, 


where the last term is due to the fact that the standard Normal N (0, 1) density integrates to 
1. Thus we have E[(X — )?] = =O =o”, and thus the parameter g? in the Gaussian 


density is shown to be the variance of the RV X — p, which is the same as the variance of 
the RV X. 


We have now established that the parameters introduced in Chapter 2, upon definition 
of the Gaussian density, are actually the mean and variance of this distribution. In practice 
these basic parameters are often estimated by making many independent observations on 
X and using Equation 4.1-1 to estimate the mean and Equation 4.1-2 to estimate ø. 


Example 4.1-8 ——— 
(mean of Cauchy) The Cauchy pdf with parameters a(—o0 < a < œœ) and (8 > 0) is 
given by 


1 
z—a\*\' 
we tist 
a( (5 y) 
Let X be Cauchy with 8 = 1, œ = 0. Then 


sai- [Ta (aai) 


fx(2) = —00 < T < 00. (4.1-20) 
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is an improper integral and doesn’t converge in the ordinary sense. However, if we evaluate 
the integral in the Cauchy principal value sense, that is, 


To 1 
X]= 1i ——— ) dz}, 4.1-21 
E| ] zaoo S e(z) J ( ) 
then E[X] = 0. Note, however, that with Y 2X? E [Y] doesn’t exist in any sense because 
°° 1 
EY] = |__| dr = 4.1-22 
Md [ot Fees Z5 (4-1-22) 


and thus fails to converge in any sense. Thus, the variance of a Cauchy RV is infinite. 





Expected value of a function of RVs. For a function of two RVs, that is, Z = g(X,Y), 
the expected value of Z can be computed from 


EIZ] = f zfal2)de 


-f j f ” g(z,y) Fey (£, y) dz dy. (4.1-28) 


To prove that Equation 4.1-23 can be used to compute E[Z] requires an argument similar 
to the one we used in establishing Equation 4.1-9. Indeed one would start with an equation 
very similar to Equation 4.1-10, for example, 


Nj 
{z; < Z < zj + A2} = J {X Y) € De}, 
k=1 


where the D,; are very small disjoint regions containing the points (zË ) yË )) such that 


g(a y ) yë ‘9)) = zj. Taking probabilities of both sides and recalling that the D, are disjoint, 
yields 


fz(z;) Az ~ Seta ONA a), 
where Aa’ ) is an infinitesimal area. 
Now multiply both sides by z; and recall that z; = g(2V ) y). Then 


zifz(z;) Az; ~ Sof (J) yP) fry (2, y) Aa 


and, as j — œ, Az; > 0, Aa? — da = dz dy, 


L zfz(z)dz = T T g(x, y)fxy (x, y) dz dy. 
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An alternative proof is of interestt. As before let Z = g(X, Y) and write 


E[Z] = T zfz(z)dz 


= F T zfzyy (zly) fy (y)dy dz. 


The second line follows from the definition of a marginal pdf. Now recall that if Z = g(X) 
then 


T zfz(z)dz = [. g(x) fx (a) dz. 


We can use this result in the present problem as follows. If we hold Y fixed at Y = y, then 
g(X, y) depends only on X, and the conditional expectation of z with Y = y is 


J ” fay (zly)de = [sev fxvw ol) ae. 


Using this result in the above yields 


E[Z) = J > sfzla)jdz 


= T E afar (zlv)dz) fy (y)dy 


7 L. T. g(z, y)fxiy (ly) fy (y) dz dy 


= J i J i g(x,y) fxy (z, y) dx dy. 


Example 4.1-9 — > 
(mean of product of independent RVs) Let g(x, y) = zy. Compute E[Z] if Z = g(X, Y) with 
X and Y independent and Normal with pdf 


fxy(z,y) = a exp -3 ((a — pa)? + (y — 1)?)| . 


tCarl W. Helstrom, Probability and Stochastic Processes for Engineers, 2nd edition. New York, 
Macmillan, 1991. 


238 Chapter 4 Expectation and Moments 





Solution Direct substitution into Equation 4.1-23 and recognizing that the resulting 
double integral factors into the product of two single integrals enables us to write 


E(Z] = Tie fz% |--Ze — pa)?! dz 


1 oO 1 2 
——— —— (y — d 
<= | væ] 352 Y mw) y 


= Habo- 


Equation 4.1-23 can be used to compute E[X] or E[Y]. Thus with Z = g(X,Y) = X, we 
obtain 


E[|X] = T L zfxy (z, y) dz dy 


_ f 7 | J 7 fry (2, vay] zdz. (4.1-24) 


By Equation 2.6-47, the integral in brackets is the marginal pdf fx(x). Hence Equa- 
tion 4.1-23 is completely consistent with the definition 


E[X] 4 J xzfx(zx)dz. 
With the help of marginal densities we can conclude that 


E[X +Y] = J > J ” tæ +y)fxy (£, y)dzdy 


= f> (i fxv (2, y)dy) dx + f» (i fv (au) ) dy 


= E[X]+ E[Y]. (4.1-25) 


Equation 4.1-24 can be extended to N random variables X1, X2,..., Xn. Thus 
N N 
E > x] => F(X] (4.1-26) 
i=l i=l 


Note that independence is not required. 


Example 4.1-10 
(independent Normal RVs) Let X, Y be jointly normal, independent RVs with pdf 


1 1] /z-m\ y- m\? 
fxy (x,y) = ——— exp | —= | | ——!) + (1 
270102 2 O1 02 
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It is clear that X and Y are independent since fxy(z,y) = fx(x)fy(y). The marginal pdf’s 
are obtained using Equations 2.6-44 and 2.6-47: 


1 1 (x—p,\? 
f(a) = ig exp |-5 (25) 
1 


ro- g- (UH) 


Thus Equation 4.1-24 yields 





E[X +Y] = m, + be. 





Example 4.1-11 
(Chi-square law) In a number of problems in engineering and science, signals add inco- 
herently, meaning that the power in the sum of the signals is merely the sum of the 
powers. This occurs, for example, in optics when a surface is illuminated by light sources 
of different wavelengths. Then the power measured on the surface is just the sum of 
the powers contributed by each of the sources. In electric circuits, when the sources are 
sinusoidal at different frequencies, the power dissipated in any resistor is the sum of the 
powers contributed by each of the sources. Suppose the individual source signals, at a 
given instant of time, are modeled as identically distributed Normal RVs. In particular 
let X1, X2,..., Xn represent the n independent signals produced by the n sources with 
Xı: N(0,1) for i = 1,2,...,n and let Y; = X2. We know from Example 3.2-2 in Chapter 3 
that the pdf of Y; is given by 
1 

Consider now the sums Z3 = Yı + Yo, Z3 = Yı + Y2 + Y3, ..., Zn = X: Yı. The pdf of Z2 
is easily computed by convolution as 





e7” u(y). 





f2,(z) = J 1 e7*/2u (x) x ——= e7? @-) yu (z — 2) dx 


T D z—y2 
1 —2z/2 : 
= 9e u(z) (exponential pdf). 
To get from line 1 to line 2 we let x = y?. To get from line 2 to line 3, we used that the 


integral is an elementary trigonometric function integral in disguise. To get the pdf of Z3 
we convolve the pdf of Za with that of Y3. The result is 


f2,(z) = sf. e—7/2u(a) x 


1 


62 —a(z — 2) dx 


1 
of in(z — x) 





zie 2?u(z). 
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The Chi-square density for n = 30, 40, 50 
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Figure 4.1-2 The Chi-square pdf for three large values for the parameter n: n = 30 (solid); n = 40 
(dashed); n = 50 (stars). For large values of n, the Chi-square pdf can be approximated by a normal 
N(n, 2n) for computing probabilities not too far from the mean. For example, for n = 30, Plu—o < 
X < p+o] = 0.6827 assuming X: N(30,60). The value computed, using single-precision arithmetic, 
using the Chi-square pdf, yields 0.6892. 


We leave the intermediate steps which involve only elementary transformations to the reader. 
Proceeding in this way, or using mathematical induction, we find that 


1 n-2 

fz, (z) = PATa Ze zlu(z). 
This pdf was introduced in Chapter 2 as the Chi-square pdf. More precisely, it is known as 
the Chi-square distribution with n degrees-of-freedom. For n > 2, the pdf has value zero at 
z = 0, reaches a peak, and then exhibits monotonically decreasing tails. For large values of 
n, it resembles a Gaussian pdf with mean in the vicinity of n. However, the Chi-square can 
never be truly Gaussian because the Chi-square RV never takes on negative values. The 
character of the Chi-square pdf is shown in Figure 4.1-2 for different values of large n. 

The mean and variance of the Chi-square RV are readily computed from the definition 
Zn 4 on, X?. Thus E[Z,] = E[E2,X?] = EL E[X?] =n. Also Var(Z,) = E[(Z, —n)?). 
After simplifying, we obtain Var(Z,,) = E[Z2] — n?. We leave it to the reader to show that 
E[Z2] = 2n + n? and, hence, that Var(Z,) = 2n. 


Example 4.1-12 
At the famous University of Politicalcorrectness (U of P), the administration requires that 
each professor be equipped with an electronic Rolodex which contains the names of every 
student in the class. When the professor wishes to call on a student, she merely hits the 
“call” button on the Rolodex, and a student’s name is selected randomly by an electronic 
circuit inside the Rolodex. By using this device the professor becomes immune to charges 
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of bias in the selection of students she calls on to answer her questions. Find an expression 
for the average number of “calls” r required so each student is called upon at least once. 


Solution The use of the electronic Rolodex implies that some students may not be called 
at all during the entire semester and other students may be called twice or three times 
in a row. It will depend on how big the class is. Nevertheless the average is well defined 
because extremely long bad runs, that is, where one or more students are not called on, 
are very rare. The careful reader may have observed that this is an occupancy problem if 
we associate “calls” with balls and students with cells. Let R € {n,n+1,n+2,...} denote 
number of balls needed to fill all the n cells for the first time. The only way that this can 
happen is that the first R— 1 balls fill all but one of the n cells (event F) and the Rth ball 
fills the remaining empty cell (event E2). Translated to the class situation, this means that 
after R — 1 calls, all but one student will have been called (event E1) and this student will 
be called on the Ath call (event E2). Thus P[R = r,n] 4 Pr(r,n) = P[E E2] = P|E1]P [E2] 
since &, and Fz are independent. Now P|E2] = 1/n since it is merely the probability that 
a given ball goes into a selected cell, and P[E,] is P,(r — 1, n) of Equation 1.8-13, that is 


re-i- (En (tea), ree 








= 0, else. 
Thus Pr(r,n) is given by 
n-i fn ; i+1\"7 
Pa(rn)= 0 (7) (-1)' (1 -— ) , rèn (4.1-27) 
= 0, else. 


The probability Px (k, n) that all n cells (students) have been filled (called) after distributing 
k balls (called k students) is, from Equation 1.8-9 


Px(k,n) = D (7) (-1)' (1 - iy, k>n (4.1-28) 


= 0, else. 


Finally, the expected value of the RV R is given by 


E[R] = DaDa (7) (-1)* (1 — ay) (4.1-29) 





Example 4.1-13 
Write’a MATLAB program for computing the probability that all the students in Example 
4.1-12 are called upon at least once in r calls from the electronic Rolodex. Assume there 
are 20 students in the class. 








Solution The appropriate equation to be coded is Equation 4.1-28. The result is shown 
in Figure 4.1-3.. 
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Figure 4.1-3 MATLAB result for Example 4.1-13. 


function [tries,prob]=occupancy(balls,cells) 
tries=1:balls; % identifies a vector ‘‘tries’’ 
prob=zeros(1,balls); % identifies a vector ‘‘prob’’ 
a=zeros(1,cells); 4% identifies a vector ‘‘a’’ 
d=zeros(1,cells); 4% identifies a vector ‘‘d’’ 


term=zeros(1,cells); 4% identifies a vector ‘‘term’’ 
% next follows the realization of Equation (4.1-27) 
for m=1:balls 


for k=1:cells 
a(k)=(-1)°*k) *prod(1:cells) /(prod(1:k)*prod(1:cells-k)) ; 
d(k)=(1-(k/cells) )“m; 
term(k)=a(k) *d(k) ; 
end 
prob(m)=1+sum(term) ; 
end 
plot (tries ,prob) 
title([‘Probability of all ’ num2str(cells) ’ students in the class 
being called in r tries’]) 
xlabel(‘number of tries’) 
ylabel([‘Probability of all ’ num2str(cells) ’ students being called 
*)) 
Example 4.1-14 — — > 
Write a MATLAB program for computing the average number of calls required for each 


student to be called at least once. Assume a maximum of 50 students and make sure the 
number of calls is large (n > 400). 
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Figure 4.1-4 MATLAB result for Example 4.1-14. 


Solution The appropriate equation to be coded is Equation 4.1-29. The result is shown 
in Figure 4.1-4. 


function [cellvec,avevec]=avertries(ballimit,cellimit); 
cellvec = 1:cellimit; 
termvec = zeros(1,ballimit); 
avevec = zeros(1,cellimit); 
brterm=zeros(1,ballimit); 
srterm=zeros(1,ballimit); 
for n=1:cellimit; 
a = zeros(1,n); 
d = zeros(1,n); 
termvec = zeros(1i,n); 
for r=1:ballimit 
for i=1:n-1 
a(i) = ((-1)*i)*prod(1:n-1)/(prod(1:i)*prod(1:n-1-i)); 
d(i) (1-((€i-1) /n)) * (r-1); 
termvec(i) = a(i)*d(i); 
end 
brterm(r)=r*sum(termvec) ; 
Irterm(r)=r*((1-(4/n)))*(r-1); 
end 
avevec(n)=sum(brterm)+sum(l1rtermn) ; 
end 
plot (cellvec,avec, ‘o’) 
title(‘Average number of Rolodex tries to call all students at least 
once’) 
xlabel(‘number of students in the class’) 
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ylabel(‘Expected number of Rolodex tries to reach all students at least 
once’) 
grid 
Example 4.1-15 —— = — > > > oo 
(geometric distribution) The RV X is said to have a geometric distribution if its probability 
mass function is given by 

Px (n) = (1—a)a"u(n), 
where u is the unit-step function? and 0 < a < 1. Clearly E% Px(n) = 1, a result easily 
obtained from ©°.,a" = (1 —a)~! for 0 < a < 1. The expected value is found from 


oo 





n d -1, _ a 
BIX]= w= (1a) Dna =(1 a) xax zil a) `} = y 
Solving for a, we obtain 
a=. 
l+pu 


Thus, we can rewrite the geometric PMF as 


Note: There is another common definition of a geometric RV where the PMF support is 
[1, 00) instead of [0, 00). The corresponding geometric law appeared early in Example 1.9-4. 
Its PMF would take the form Px(n) = (1—a)a"—!u(n — 1), that is, the same sequence of 
numbers shifted right one place. 





4.2 CONDITIONAL EXPECTATIONS 


In many practical situations we want to know the average of a subset of the population: 
the average of the passing grades of an exam; the average lifespan of people who are still 
alive at age 70; the average height of fighter pilots (many air forces have both an upper and 
lower limit on the acceptable height of a pilot); the average blood pressure of long-distance 
runners, and so forth. Problems of this type fall within the realm of conditional expectations. 

In conditional expectations we compute the average of a subset of a population that 
shares some property due to the outcome of an event. For example in the case of the average 
of passing grades, the subset is those exams that received passing grades. What all these 
exams share is that their grade is, say, >65. The event that has occurred is that they 
received passing grades. 


Definition 4.2-1 The conditional expectation of X given that the event B has 
occurred is 


E|X|B] Ê T 2fx\p(2|B) dz. (4.2-1) 


İThat is, u(n) = 1 for n > 0 and u(n) = 0, else. 
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If X is discrete, then Equation 4.2-1 can be replaced with 


E[X|B] = > tiPxe(2:|B). a (4.2-2) 


To give the reader a feel for the notion of conditional expectation, consider the following 
exam scores in a course on probability theory: 28, 35, 44, 66, 68, 75, 77, 80, 85, 87, 90, 
100, 100. Assume that the passing grade is 65. Then the average score is 71.9; however, the 
average passing score is 82.8. A closely related example is worked out as follows. 


Example 4.2-1 —— S 
(conditional expectation of uniform distribution) Consider a continuous RV X and the event 


BÊ {X > a}. From Equations 2.6-1 and 2.6-2 and a little bit of work, we obtain 


0, r<a, 
Fx \pB(z|X >a) = Fy (x) — Fyx(a) r>a (4.2-3) 
1 — Fx (a) , = 
Hence 
0, z<a, 
fx|p(2|X >a) = fx(z) (4.2-4) 
I- Fa) 77? 
and 
T zfx(x) dx 
E[X|X > a] = =s. (4.2-5) 
f fx(z)dz 


Assume that X is a uniform RV in [0, 100]. Then 


1 100 


but using Equation 4.2-5 with a = 65 
E[X|X > 65] = 82.5. 





Conditional expectations often occur when dealing with RVs that are related in some way. 
For example let Y denote the lifetime of a person chosen at random, and let X be a binary 
RV that denotes whether the person smokes or not, that is, X = 0 if a nonsmoker, X = 1 ifa 
smoker. Then clearly E[Y|X = 0] is expected to be larger! than E[Y|X = 1]. Or let X be the 


tStatistical evidence indicates that each cigarette smoked reduces longevity by about eight minutes. 
Hence smoking one pack a day for a whole year reduces the expected longevity of the smoker by 40 days! 
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intensity of the incident illumination and let Y be the instantaneous photocurrent generated 
by a photodetector. Typically the expected value of Y will be larger for stronger illumination 
and smaller for weaker illumination. We define some important concepts as follows. 


Definition 4.2-2 Let X and Y be discrete RVs with joint PMF Px y(2;,y;). Then 
the conditional expectation of Y given X = z; denoted by E|Y|X = 2;] is 


A 
E[Y |X = xi] = X yj Pyix (ysl). (4.2-6) 
J 
Here Pyjx(y;|z:) is the conditional probability that {Y = y;} occurs given that {X = z:} 
has occurred and is given by Px y(2i,y;)/Px(a:). E 


We can derive an interesting and useful formula for E[Y] in terms of the conditional 
expectation of Y given X = 2. The reasoning is much the same as that which we used in 
computing the average or total probability of an event in terms of its conditional probabil- 
ities (see Equation 1.6-7 or 2.6-4). Thus, 


EIY] = X` y;Py (y;) (4.2-7) 
j 


= Yy XO Px. (i, ys) 
j i 


= » D u Prix (yle) Px (zi) 
= DEVX = xj] Px (2). (4.2-8) 


Equation 4.2-8 is a very neat result and says that we can compute E[Y] by averaging the 
conditional expectation of Y given X with respect to X.1 Thus, in the smoking-longevity 
example discussed earlier, suppose E[Y|X = 0] = 79.2 years and E[Y|X = 1] = 69.4 years 
and Px (0) = 0.75 and Px(1) = 0.25. Then 


E[Y] = 79.2 x 0.75 + 69.4 x 0.25 = 76.75 


is the expected lifetime of the general population. 
A result similar to Equation 4.2-8 holds for the continuous case as well. It is derived 
using Equation 2.6-85 from Chapter 2, that is, 


fy\x(y|z) = fee fx(x) #0. (4.2-9) 


The definition of conditional expectation for a continuous RV follows. 
tNotice that this statement implies that the conditional expectation of Y given X is an RV. We shall 


elaborate on this important concept shortly. For the moment we assume that X assumes the fixed value g; 
(or x). 
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Definition 4.2-3 Let X and Y be continuous RVs with joint pdf fxy(z, y). Let the 
conditional pdf of Y given that X = x be denoted as in Equation 4.2-9. Then the conditional 
expectation of Y given that X = x is given by 


EYIX =2)ê f” vfvix(vle)dy m (4.2-10) 


Since 


E[Y] = T T yfxy(z, y) dz dy, (4.2-11) 


it follows from Equations 4.2-9 and 4.2-10 that 


BY =f fxe) | f” wfrretuteday| az 


= / i E[Y|X = z] fx (z) dz. (4.2-12) 


Equation 4.2-12 is the continuous RV equivalent of Equation 4.2-8. It can be used to good 
advantage (over the direct method) for computing E[Y]. We illustrate this point with an 
example from optical communications. 


Example 4.2-2  — = S 
(conditional Poisson) In the photoelectric detector shown in Figure 4.2-1, the number of 
photoelectrons Y produced in time 7 depends on the (normalized) incident energy X. If X 
were constant, say X = z, Y would be a Poisson RV [4-4] with parameter z, but as real light 


i(t) 


Current pulse due to 
single photoelectron 


Photodetector 





Incident light t 


i(t)} —— Output 


Figure 4.2-1 In a photoelectric detector, incident illumination generates a current consisting of photo- 
generated electrons. 
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sources—except for gain-stabilized lasers—do not emit constant energy signals, X must be 
treated as an RV. In certain situations the pdf of X is accurately modeled by 


1 exp (-4) z>0 
fx(2)= 4 px px)? 75? (4.2-13) 
0, z <0, 


where uy is a parameter that equals E[X]. We shall now compute E[Y] using Equation 4.2-12 
and using the direct method. 


Solution Since for X = z, Y is Poisson, we can write 
xk 
PIY =k|X = 2] = ze” k =0,1,2,... 
and, from Example 4.1-6, 
El|Y|X =a] = z. 


Finally, using Equation 4.2-12 with the appropriate substitutions, that is, 


°° 1 
EIY] =| x [+ exp (-2.)| dz, 
0 Hx Hx 
we obtain, by integration by parts, 
E[Y] = ux. 


In contrast to the simplicity with which we obtained this result, consider the direct approach, 
that is, 


E[Y] = X` kPy (k). (4.2-14) 
k=0 
To compute Py (k) we use the Poisson transform (Equation 2.6-14) with fx(z), as given by 


Equation 4.2-13. This furnishes (see Equation 2.6-23) 


k 


Py (k) = CENS (4.2-15) 


Finally, using Equation 4.2-15 in 4.2-14 yields 


pk 
(1+ pry )RHT” 





E|Y] -Yz 


k=0 


It is known that this series sums to zy. Alternatively one can evaluate the sum indirectly 
using some clever tricks involving derivatives. 
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Example 4.2-3 
(conditional Gaussian density) Let X and Y be two zero-mean RVs with joint density 


1 x? + y* — pry 

fxy (a, y) = 2ng? /1 — r exp ( 202(1 _ p?) ) lel <1. (4.2 16) 
We shall soon find out (Section 4.3) that the pdf in Equation 4.2-16 is a special case of 
the general joint Gaussian law for two RVs. First we see that when p # 0, fxy(z,y) £ 
fx(x)fy (y); hence X and Y are not independent when p 4 0. When p = 0, we can indeed 
write fxy (x,y) = fx(z)fy(y) so that p = 0 implies independence. For the present, however, 
our unfamiliarity with the meaning of p (p is called the normalized covariance or correlation 
coefficient) is not important. When p is zero, X and Y are zero-mean Gaussian RVs, that is, 


fx(2) = fy (£) = age 


However, the conditional expectation of Y given X = zx is not zero even though Y is a 
zero-mean RV! In fact from Equation 4.2-9, 











fy|x(ylz) = (y= px)" ) (4.2-17) 


1 
ex 
mol p) ( 20?(1 — p?) 


Hence fy|x(y|z) is Gaussian with mean px. Thus, 


EWIX =2]= f 


oo 


ufy\x (ylz) dy 


= pT. (4.2-18) 


When p is close to unity, E[Y|X = z] ~ z, which implies that Y tracks X quite closely 
(exactly if p = 1), and if we wish to predict Y, say, with Yp upon observing X = gz, 
a good bet is to choose our predicted value Yp = x. On the other hand, when p = 0, 
observing X doesn’t help us to predict Y. Thus, we see that in the Gaussian case at least 
and somewhat more generally, p is related to the predictability of one RV from observing 
another. A cautionary note should be sounded, however: The fact that one RV doesn’t help 
us to linearly predict another doesn’t generally mean that the two RVs are independent. 


Example 4.2-4 — > 

(expectation conditioned on sums of RVs) Consider the two independent, discrete, RVs Kı 

and K2. We wish to compute E[K,|Ki + K2 = ml. It is first necessary to determine the 

conditional probability P[K, = kı|Kı + K2 = m]. This conditional probability can be 

written as 

PiKy = kı, Kı + Ke =m] 
PIKy +K: = m] 

_ PiKy = kı, Ke =m — kı] 

o PK: +K: = m] 

_ PK: = kı]P[K2 =m — kı] 

~ P[K: + Kə = m] i 


Pik, = kı|Kı + Ko = m] = 


(4.2-19) 
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Let Kı and K; be each distributed as Poisson with the same parameter 0. Since these RVs 
are independent and identically distributed we designate them as i.i.d. RVs. Then, from 
Equation 4.2-19 and Example 3.3-8 we get 

P|K, = kilKı +K = m] 


_ (€7 0 /ka!) x (678263 t /(m — ki)!) (4.2-20) 


= (7) gkigm—™ x (Ba +02) 7™. 


Now recall that E[K; |K, +K2 = m] Ê Z? o ki P[Ky = kı|Kı + Kz = m] and the binomial 
expansion formula is given by }`p—o z) oken—* = (01 +02)”. Then using Equation 4.2-20 
finally yields 





EjK:|Ki + Kp =m) =mx (z + =) . (4.2-21) 


Example 4.2-5 
(continuation of Example 4.2-4) Let Kı, K2, K3 denote multinomial RVs for l = 3, that is, 
a three-nomial (three outcomes possible). Then for n trials, we have the PMFt 





Pg (ki, k2, k3) = P[Ky = ki, K2 = ko, K3 = ks] 


! ki kok: 
— { FulesikgiP1 P2 P3°, ky + k2 + fs all ki > 0, (4.2-22) 


where p + p2 + p3 = 1. We wish to compute E[K,|K, + K2 = ml. 


Solution As in the previous example, we need to compute P[K, = ki|K1 + Ko = m]. We 
write 
P|K: = kọ Kı +K = m) 
PIK: = k| K, + Kp =m) = StS i t m 
| 1 | i+ 2 m] P[k, + K2 = m] 
Note that for the multinomial, the event {¢: Kı (C) + Ko(¢) = m} N {¢: Ki(¢) = kı} is 
identical to the event {¢: Ki(¢) = kı, K2 (Ç) = m — kı, K3(¢) = n — m}. Hence 


İNote the notation different from that in the binomial case. Using this new multinomial notation for 
the binomial case, we would have, for a binomial RV K: Kı = K and Kz = n — K. In the general l-nomial - 
distribution we must always abide by the constraint that Kı + K2 +... + Ki =n. 
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_ P[Ki =k, K2 =m—h,K3=n-m] 








P[Ky = kı|Kı + K2 = m] PIK; =n- v] (4.2-23) 
~ kı!(m — aye — mit PE p 
+ acme" — p3)” 
= (x) pi pg ™ (pi +p)” (4.2-24) 
Finally, using 
B|Ki|Kı + Ko = m] = Ý kı P[Ky = kı|Kı + Ko = ml, 
kı 
we obtain that 
E[K,|K, + Kz =m] = m Ai (4.2-25) 
We leave it to the reader to compute that 
E|[K2|K, + Ka =m] = mf (4.2-26) 


These kinds of problems occur in the estimation procedure known as the expectation- 
maximization algorithm, discussed in detail in Chapter 11. 





Conditional Expectation as a Random Variable 


Consider, for the sake of being specific, a function Y = g(X) of a discrete RV X. Then its 
expected value is 


E|Y] = $ 9(xi) Px (zi) 
= Elg(X)]. 
This suggests that we could write Equation 4.2-8 in similar notation, that is, 


Ely] = > EYIX = q;]Px (z:) 


= EJE[Y|X]]. (4.2-27) 


It is important to note that the object E[Y|X = z;] is a number, as is g(x;), but the 
object E[Y|X] is a function of the RV X and therefore is itself an RV. Given a probability 
space Y= (0,¥P) and an RV X defined on Z for each outcome ¢ € 2 we generate 
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the real number E[Y|X = X(€)]. Thus, for Ç variable E[Y |X] is an RV that assumes the 
value E[Y|X = X(C)] when ¢ is the outcome of the underlying experiment. As always, the 
functional dependence of X on ¢ is suppressed, and we specify X rather than the underlying 
probability space Z The following example illustrates the use of the conditional expectation 
as an RV. 


Example 4.2-6 — > — > 
(multi-channel communications) Consider a communication system in which the message 
delay (in milliseconds) is T and the channel choice is L. Let L = 1 for a satellite channel, 
L = 2 for a coaxial cable channel, L = 3 for a microwave surface link, and L = 4 for a fiber- 
optical link. A channel is chosen based on availability, which is a random phenomenon. 
Suppose P,(l) = 1/4, l = 1,...,4. Assume that it is known that E[T|L = 1] = 500, 
E[T|L = 2] = 300, E[T|L = 3] = 200, and E[T|L = 4] = 100. Then the RV g(L) 4 E(r|L| 
is defined by 

500, for L=1 P(1)= 
300, for L=2 Pr( 
200, for L=3 P,(3 
100, for L=4 P,{ 


g(L) = 


Al le ele Ale 


and E[T] = E[g(L)] = 500 x 1 + 300 x 1+ 200 x } + 100 x 4 = 275. 





The notion of E[Y |X] being an RV is equally valid for discrete, continuous, or mixed RVs X. 
For example, Equation 4.2-12 


EIY] = J 7 EYIX = z) fx (2) de 


can also be written as E[Y] = E [E[Y|X]], where E[Y |X] in this case is a function of the 
continuous RV X. The inner expectation is with respect to Y and the outer with respect 
to X. 

The foregoing can be extended to more complex situations. For example, the object 
E{Z|X,Y] is a function of the RVs X and Y and therefore is a function of two RVs. For 
a particular outcome Ç € Q, it assumes the value E[Z|X(¢), Y(¢)]. To compute E[Z] we 
would write E[Z] = E|E[Z|X,Y]], which, for example, in the case of continuous RVs yields 


E|Z] = E[E|Z|X,Y]] 


~ f i f i J i zfax,y (z|z,y) fxy (2, y)dzdydz. (4.2-28) 


We conclude this section by summarizing some properties of conditional expectations. 
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Properties of Conditional Expectation:. 
Property (i). E[Y] = E[E[Y|X]]. 


Proof See arguments leading up to Equation 4.2-8 for the discrete case and Equa- 
tion 4.2-12 for the continuous case. The inner expectation is with respect to Y, the outer 
with respect to X. 


Property (ii). If X and Y are independent, then E[Y |X] = E[Y]. 


Proof 


oo 


yfyıx (y|z)dy. 


EY |X =a] =f 


But fxy(z,y) = fyix(ylz)fx(z) = fy(y)fx(a) if X and Y are independent. Hence 
fyix(ylz) = fy (y) and 


EIX =2]= f ~ yfr(v)dy = BLY] 


—90 
for each x. Thus, 


zyx = f ” yfr(ydy = EY]. 


An analogous proof holds for the discrete case. 
Property (ïi). E[|Z|X] = E[E[Z|X,Y]|X]. 


Proof 
E[Z|X = z] = / zfz\x(z|x)dz 
-J J zfa\x,y (zlz,y) fyix (yla)dz dy © 


= T dy fyix (ylz) T zfz\x,y (z\£, y)dz 
= E [E[ZIX,Y]IX = 2), 


where the inner expectation is with respect to Z and the outer with respect to Y. Since 
this is true for all x, we have E[Z|X] = E[E[Z|X,Y]|X]. The mean py = E[Y] is an 
estimate of the RV Y. The mean-square error in this estimate is e? = E|(Y — py)?]. 
In fact this estimate is optimal in that any constant other than uy would lead to an 
increased £?. W 
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4.3 MOMENTS OF RANDOM VARIABLES 


Although the expectation is an important “summary” number for the behavior of an RV, 
it is far from adequate in describing the complete behavior of the RV. Indeed, we saw in 
Section 4.1 that two sets of numbers could have the same sample mean but the sample 
deviations could be quite different. Likewise, for two RVs their expectations could be the 
same but their standard deviations could be very different. Summary numbers like py, 0%, 
E[X?], and others are called moments. Generally, an RV will have many nonzero higher- 
order moments and, under certain conditions (Section 4.5), it is possible to completely 
describe the behavior of the RV, that is, reconstruct its pdf from knowledge of all the 
moments. In the following definitions we shall assume that the moments exist. However, 
this is not always the case. 


Definition 4.3-1 The rth moment of X is defined as 


Mp 4 E[X"] = f z” fx(x)dxz,- where r = 0,1,2,3,.... (4.3-1) 


If X is a discrete RV, the rth moment can be computed from the PMF as 


A r 
Mr = > x; Px (zi). 


We note that mo = 1, mı = u (the mean). E 
Definition 4.3-2 The rth central moment of X is defined as 
cr 2 E[(X —)"], where r =0,1,2,3,.... (4.3-2a) 


For a discrete RV we can compute c, from 
A 
cr = X (z; — p)" Px(2i). M (4.3-2b) 
i 


The most frequently used central moment is cz. It is called the variance and is denoted by 
o? and also sometimes by Var[X]. Note that co = 1, c1 = 0, cg = o?. An important formula 
that connects the variance to E[X?] and p is obtained as follows: 


o? = E [[X — y)*] = E[X?] — E[24X] + E[y?). 
But for any constant a, E[aX] = aE[X] and E[a?] = a?. Thus 
o? = E[X?] — QuE[X]+ p? 
= E|X?] - x? > (4.3-3) 


since E[X] 2 L- In order to save symbology, an overbar is often used to denote expectation. 


Thus X ĉ E [X"], and so forth, for other moments. Using this notation, Equation 4.3-3 
appears as 


o? = X? — 4? (4.3-4a) 
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or, equivalently, 
X2=o7 4p’. (4.3-4b) 


Equations 4.3-4a relates the second central moment cz to 4 and u. We can generalize this 
result as follows. Observe that 


(x= wy =o (7) Eix (4.3-5a) 
i=0 


By taking the expectation of both sides of Equation 4.3-5a and recalling the linearity of the 
expectation operator, we obtain 


on = 2 (7) Dimes (4.3-5b) 


Example 4.3-1 
Let us compute mə for X, a binomial RV. By definition 


Px(k) = (z) pegr* 
and 
me = Sok? (z) p*qn—* 
k=0 
= p’n(n —1)4+ np 


= n*p* + npg. (4.3-6) 


In going from line 2 to line 3 several steps of algebra were used whose duplication we leave 
as an exercise. In going from line 3 to line 4, we rearranged terms and used the fact that 


q 4i- p. The expected value of X is 


_ = n! kon- 
m= kia- Hra 


k 


Using this result in Equation 4.3-6 and recalling Equation 4.3-4 allow us to conclude that 
for a binomial RV with PMF b(k;n, p) 


a? = npq. (4.3-8) 
For any given n, maximum variance is obtained when p = q = 0.5 (Figure 4.3-1). 


Example 4.3-2 — >> o o 
(second moment of zero-mean Gaussian) Let us compute central moment cz for X : N (0, 07). 
Since p = 0, co = mg and 
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Figure 4.3-1 Variance of a binomial RV versus p. 





1 f 2-4 (2/0)? 
C2 = ze 2 dz. 
V 210 2 =œ 
But this integral was already evaluated in Example 4.1-7, where we found E[X?] = ø?. 
Thus, the variance of a Gaussian RV is indeed the parameter o? regardless of whether X is 
zero-mean or not. 








An interesting and somewhat more difficult example that illustrates a useful application 
of moments is given next. 


Example 4.3-3 
(entropy) The maximum entropy (ME) principle states that if we don’t know the pdf fx (z) 
of X but would like to estimate it with a function, say p(x), a good choice is the function 
p(x) which maximizes the entropy, defined by [4-5], 





A(X] 2 -f p(x) In p(x) dx (4.3-9) 


and which satisfies the constraints 


p(x) >0 (4.3-10a) 

J p(x) dz =1 (4.3-10b) 
-00 

J xp(x) dz = p (4.3-10c) 

J z?p(x) dr = me, and so forth. (4.3-10d) 


Suppose we know from measurements or otherwise only p in Equation 4.3-10c and that 
x > 0. Thus, we wish to find p(x) that maximizes H[X] of Equation 4.3-9 subject to 


Sec. 4.3. MOMENTS OF RANDOM VARIABLES 257 





the first three constraints of Equation 4.3-10. Using the method of Lagrange multipliers 
[4-6], the solution is obtained by maximizing the expression 


- [ p(x) In p(x) dx — ry [ p(x) da — Az ft xp(x) dz 


by differentiation with respect to p(x). The constants A; and Az are Lagrange multipliers 
and must be determined. After differentiating we obtain 


Inp(z) = —(14+ Ax) — Ager 
or 
p(x) = eA tata), (4.3-11) 


When this result is substituted in Equations 4.3-10b and 4.3-10c, we find that 
-~at+a) _ 1 
e =~, u> 0, 
B 


and 


Hence our ME estimate of fx (x) is 


1 —a/p 
—e , «£>0, 

p(t) = 4 H (4.3-12) 
0, z<0. 


The problem of obtaining the ME estimate of fx (x) when both u and o? are known is left 


as an exercise. In this case p(x) is the Normal distribution with mean py and variance o°. 





Tables of common means, variances, and mean-square values. Table 4.3-1 is a table 

of means, variances, and mean-square values for common continuous RVs. Some of these 

have been calculated already in the text. Others are left as end-of-chapter problems. 
Table 4.3-2 is a similar table for common discrete RVs. 

Less useful than m, or c, are the absolute moments and generalized moments about some 

arbitrary point, say a, defined by, respectively, 


oo 
E| XI] ef |z|" fx(x)dz (absolute moment) 


E(X — a)"] 4 T (x - a)" fx(x)dxz (generalized moment). 


Note that if we set a = p, the generalized moments about a are then the central moments. 
If a = 0, the generalized moments are simply the moments m,. 


258 Chapter 4 Expectation and Moments 





Table 4.3-1 Means, Variances and Mean-Square values for Common Continuous RVs 


Family pdf f(r) Mean p = E[X] Variance o° Mean square E[X?| 





1 1 1 
Uniform U(a, b) 3 (a+b) pe — a)? 3 (b + ab+ a?) 
Exponential we tu(s) m we 2p? 
. 1 _(e=u)? 2 2 2 
Gaussian e 2c B o K +o 
210 
Laplacian le Hla 0 g? o 
20 
. z a? T T\ 2 2 
Rayleigh zE 207 u(x) 37 (2 — =) a 20 


Table 4.3-2 Means, Variances, and Mean-Square Values for Common Discrete RVs 


Family PMF P(k) Mean p= E[K] Variance o° Mean 
square 
E|K?] 

ernoulli Pak) =40 qi 1—-p P pq p 
Binomial blk; n, p) = (2)p*qr* np npq (np)? + npq 
k 
1 
Geometric! Tth (+) u(k) H B+R pt 2p? 
ak 
Poisson gre ule) a a ata 





Joint Moments 


Let us now turn to a topic first touched upon in Example 4.2-3. Suppose we are given 
two RVs X and Y and wish to have a measure of how good a linear prediction we can 
make of the value of, say, Y upon observing what value X has. At one extreme if X and 
Y are independent, observing X tells us nothing about Y. At the other extreme if, say, 
Y = aX +b, then observing the value of X immediately tells us the value of Y. However, in 
many situations in the real world, two RVs are neither completely independent nor linearly 
dependent. Given this state of affairs, it then becomes important to have a measure of 
how much can be said about one RV from observing another. The quantities called joint 
moments offer us such a measure. Not all joint moments, to be sure, are equally important 
in this task; especially important are certain second-order joint moments (to be defined 
shortly). However, as we shall see later, in various applications other joint moments are 
important as well and so we shall deal with the general case below. 


t The geometric PMF is sometimes written in terms of the parameter a as (1-a)a*u(k) with0 <a < 1. 
Then p = a/(1 — a) with u > 0. 
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Definition 4.3-3 The ijth joint moment of X and Y is given by 


Mij 4 E[X'*Y?] 
o0 oo . 
= f J t'y’ fxy (x, y) dz dy. (4.3-13) 
00 —00 
If X and Y are discrete, we can compute y,; from the PMF as 
A ` . 
Mij = > X zivh, Px, (a1, Ym) E (4.3-14) 
l m 


Definition 4.3-4 The ijth joint central moment of X and Y is defined by 
cij = E(X - X} (Y - YY], (4.3-15) 


where, in the notation introduced earlier, X £ E[|X], and so forth, for Y. The order of the 
moment is i + j. Thus, all of the following are second-order moments: 


mo = E[Y?] con = ERY — Y)?] 
mæ = E|X?] cæ = E|(X - X)’] 
mı = E[XY] on =El(X -XY -Y)] 
= E|XY]-X Y 
2 Cov|X, Y]. E 


As measures of predictability and in some cases statistical dependence, the most important 
joint moments are mıı and c11; they are known as the correlation and covariance of X and 
Y , respectively. The correlation coefficient? defined by 





A C11 
= 4.3-1 
P y 0220 ( 6) 


was already introduced in Section 4.2 (Equation 4.2-16). It satisfies |p| < 1. To show this 
consider the nonnegative expression 


E[(A(X - px) — (Y ~- uy)}?] > 0, 


where À is any real constant. To verify that the left side is indeed nonnegative, we merely 
rewrite it in the form 


QWE S" [T Ae- ux) ~ w= ay IP Faev (2,9) de dy > 0, 


tNote that it would be more properly termed the covariance coefficient or normalized covariance. 
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where the > follows from the fact that the integral of a nonnegative quantity cannot be 
negative. 
The previous equation is a quadratic in A. Indeed, after expanding we obtain 


Q(A) = A? e290 + o2 — 2Acis > 0. 


Thus Q(A) can have at most one real root. Hence its discriminant must satisfy 
u) co 
(=) <4 
C20 C20 


ct Š co2ca20 (4.3-17) 


or 


whence the condition |p| < 1 follows. 
When c?, = CozC20, that is, |p| = 1, it is readily established that 


E 





2 
c 
(=x — px) - Y - my) ) | =0 
C20 
or, equivalently, that 
oo o0 C11 2 
ES (Se-ux)-w-m)) fertew)ardy=0. (4318) 
—co Joo \ C20 


Since fxy(z,y) is never negative, Equation 4.3-18 implies that the term in parentheses is 
zero everywhere.’ Thus, we have from Equation 4.3-18 that when |p| = 1 


Y = H(X —py) + fy, (4.3-19) 
C20 


that is, Y is a linear function of X. When Cov| X,Y] = 0, p = 0 and X and Y are said to 
be uncorrelated. 


Properties of Uncorrelated Random Variables 
(a) If X and Y are uncorrelated, then 
Oxy =o% +04, (4.3-20) 


where 
Oxy SEX + Y)?] — (E[X + Y]}?. 


tExcept possibly over a bizarre set of points of zero probability. To be more precise, we should exchange 
the word “everywhere” in the text to “almost everywhere,” often abbreviated a.e. 
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(b) If X and Y are independent, they are uncorrelated. Proof of (a): We leave this as 
an exercise to the reader; proof of (b): Since Cov[X,Y] = E[XY] — E[X]E[Y], we must 
show that E[XY] = E|X]E[Y]. But 


ELXY] f j f ” syfxy(£,y) de dy 


/ xfx(x) dz J yfy(y)dy (by independence assumption) 


= E[X]E[Y]. m 


Example 4.3-4 
(linear prediction) Suppose we wish to predict the values of an RV Y by observing the 
values of another RV X. In particular, the available data (Figure 4.3-2) suggest that. a good 
prediction model for Y is the linear function 





Yp ĉaX +8. (4.3-21) 


Now although Y may be related to X, the values it takes on may be influenced by other 
sources that do not affect X. Thus, in general, |p| 4 1 and we expect that there will be 
an error between the predicted value of Y, that is, Yp, and the value that Y actually 
assumes. Our task becomes then to adjust the coefficients a and @ in order to minimize the 
mean-square error 


e? 2 E|(Y —Yp)?]. (4.3-22) 


This problem is a simple version of optimum linear prediction. In statistics it is called Linear 
regression. 


Solution Upon expanding Equation 4.3-22, we obtain 
e? = E[Y*] — 2apxy — 28uy +2abux + a? E[X?] + 8. 





x 


Figure 4.3-2 Pairwise observations on (X, Y) constitute a scatter diagram. The relationship between 
X and Y is approximated with a straight line. 
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To minimize £ with respect to a and 8, we solve for a and £ that satisfy 


Oe? ðe? 


Z = = A.3-23 
Oa 0 ðB 0 ( ) 


This yields the best œa and 8, which we denote by ap, Bo in the sense that they minimize €. 
A little algebra establishes that 


Cov[X,Y] poy 
ao = = 





4.3-24 
a (4.3-24a) 
and 
> Co[X, Y] 
B= Y- ry 
ox 
=Y - pX. (4.3-24b) 
Ox 
Thus, the best linear predictor is given by 
a 
Yp — py = p—-(X - px) (4.3-25) 
ox 


and passes through the point (uy, py). If we use ap, Bo in Equation 4.3-22 we obtain the 
smallest mean-square error €2,,,, which is Problem 4.33, 


Emin = OF (1 — p°). (4.3-26) 


Something rather strange happens when p = 0. From Equation 4.3-25 we see that for 
p = 0, Yp = py regardless of X! This means that observing X has no bearing on our 
prediction of Y, and the best predictor is merely Yp = uy. We encountered somewhat the 
same situation in Example 4.2-3. Thus, associating the correlation coefficient with ability to 
predict seems justified in problems involving linear prediction and the joint Gaussian pdf. 
In some fields, a lack of correlation between two RVs is taken to be prima facie evidence 
that they are unrelated, that is, independent. No doubt this conclusion arises in part from 
the fact that if two RVs, say, X and Y, are indeed independent, they will be uncorrelated. 
As stated earlier, the opposite is generally not true. An example follows. 


Example 4.3-5 


(uncorrelated is weaker than independence) Consider two RVs X and Y with joint PMF 
Px y (£i, yj) as shown. 








Values of Pyy (xi, y;) 


m= ~1| z2=0 | z3=+1 | 


yi =0 0 











© 








© | wie 


_ 1 
y2 =1 3 


w= 
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X and Y are not independent, since Pxy(0,1) = 0 # Px(0)Py(1) = 2. Furthermore, 
ux =0s0 that Cov(X, Y) = E[XY] - ux uy = ELXY]. We readily compute 


mn = (-1)(1)§ + (1)(1)§ = 0. 


Hence X and Y are uncorrelated but not independent. 


There is an important special case for which p = 0 always implies independence. We now 
discuss this case. 


Jointly Gaussian Random Variables 


We say that two RVs are jointly Gaussian! (or jointly Normal) if their joint pdf is 





- 1 kel f (zee 
fxy(@,y) = Qnoxoy /l-pe > (aa {( z) 
yp @—HxM(y = Hy) (ey) (43-27) 


Oxdy OY 


Five parameters are involved: ox, Cy, Hx, Hy, and p. If p = 0 we observe that 


fxy(z,y) = fx(x)fy (y), 


_ 1 ` 1 fz—- py 2 
e) = Tae oP (-3( ox )) (4.3-28) 


where 





and 





_ 1 e 1 (/y— py g 
July) = pg os 5 ( oy )). (4.3-29) 


Thus, two jointly Gaussian RVs that are uncorrelated (i.e., p = 0) are also independent. The 
marginal densities fx (x) and fy(y) for jointly normal RVs are always normal regardless of 
what p is. However, the converse does not hold; that is, if fx(z) and fy(y) are Gaussian, 
one cannot conclude that X and Y are jointly Gaussian. 

To see this we borrow from a popular x-ray imaging technique called computerized 
tomography (CT) useful for detecting cancer and other abnormalities in the body. Suppose 
we have an object with x-ray absorptivity function f(z, y) > 0. This function is like a joint 
pdf in that it is real, never negative, and easily normalized to a unit volume—however, this 
last feature is not important. Thus, we can establish a one-to-one relationship between a 
joint pdf fxy(z, y) and the x-ray absorptivity f(x,y). In CT, x-rays are passed through the 


tThe jointly Normal pdf is sometimes called the two-dimensional Normal pdf in anticipation of the 
general multi-dimensional Normal pdf. The later becomes very cumbersome to write without using matrix 
notation (Chapter 5). 
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object along different lines, for some fixed angle, and the integrals of the absorptivity are 
measured and recorded. Each integral is called a projection and the set of all projections 
for given angle @ is called the profile function at 0. Thus, the projection for a line at angle 
6 and displacement s from the center is given by [Figure. 4.3-3(a)] 


fo(s) = f g (Ove 


where L(s,0) are the points along a line displaced from the center by s at angle 0 and dl 
is a differential length along L(s,0). If we let s vary from its smallest to largest value, we 
obtain the profile function for that angle. By collecting all the profiles for all the angles and 
using a sophisticated algorithm called filtered-convolution back-projection, it is possible to 
get a high-quality x-ray image of the body. Suppose we measure the profiles at 0 degrees 
and 90 degrees as shown in Figure 4.3-3(b). Then we obtain 


filz) = f > f(z, y)dy (horizontal profile) 


faly) = [sen dz (vertical profile). 


If f(x,y) is Gaussian, then we already know that fı(x) and fo(y) will be Gaussian because 
fı and fz are analogous to marginal pdfs. Now is it possible to modify f(x,y) from Gaussian 
to non-Gaussian without observing a change in the Gaussian profile? If yes, we have demon- 
strated our assertion that Gaussian marginals do not necessarily imply a joint Gaussian 
pdf. In Figure 4.3-3(c) we increase the absorptivity of the object by an amount P along the 
45-degree strip running from a to b and decrease the absorptivity by the same amount P 
along the 135-degree strip running from a’ to b’. Then since the profile integrals add and 





Source 


(a) 





Figure 4.3-3 Using the computerized tomography paradigm to show that Gaussian marginal pdf's do 
not imply a joint Gaussian distribution. (a) A projection is the line integral at displacement s and angle . 
6. The set of all projections for a given angle is the profile function for that angle. (b) A joint Gaussian 
x-ray object produces Gaussian-shaped profile functions in the horizontal and vertical directions; (c) by 
adding a constant absorptivity along 2-b and subtracting an absorptivity along a’—b’, the profile functions 
remain the same but the underlying absorptivity is not Gaussian anymore. 
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subtract P in both horizontal and vertical directions, the net change in fı(x) and fo(y) is 
zero. This proves our assertion. We assume that P is not so large that when subtracted 
from f(x,y) along a’—b’, the result is negative. The reason we must make this assumption 
is that pdf’s and x-ray absorptivities can never be negative. 

To illustrate a joint normal distribution consider the following somewhat idealized situ- 
ation. Let X and Y denote the height of the husband and wife, respectively, of a married pair 
picked at random from the population of married people. It is often assumed that X and 
Y are individually Gaussian although this is obviously only an approximation since heights 
are bounded from below by zero and from above by physiological constraints. Conventional 
wisdom has it that in our society tall people prefer tall mates and short people prefer short 
mates. If this is indeed true, then X and Y are positively correlated, that is, p > 0. On the 
other hand, in certain societies it may be fashionable for tall men to marry short women 
and for tall women to marry short men. Again we can expect X and Y to be correlated 
albeit negatively this time, that is, p < 0. Finally, if all marriages are the result of a lottery, 
we would expect p to be zero or very small. 


*Contours of constant density of the joint Gaussian pdf. It is of interest to determine 
the locus of points in the zy plane when fxy(z, y) is set constant. Clearly fxy (x,y) will 
be constant if the exponent. is set to a constant, say, a?: 


(=y pE — Blu te) 4 (vzr) a 


Ox Oxdoy Oy 





This is the equation of an ellipse centered at x = py, y = py. For simplicity we set 
[tx = Hy = 0. When p = 0, the major and minor diameters of the ellipse are parallel to the 
z- and y-axes, a condition we know to associate with independence of X and Y. If p = 0 
and ox = oy, the ellipse degenerates into a circle. Several cases are shown in Figure 4.3-4. 

Surprisingly the marginal densities fx (xz) and fy(y) computed from the joint pdf of 
Equation 4.3-27 do not depend on the parameter p. To see this we compute 


fx(z) = T fxy (x, y)dy 


with uy = uy = 0 for simplicity. The integration, while somewhat messy, is easily done by 
following these three steps: 


1. Factor out of the integral all terms that do not depend on y; 

2. Complete the squares in the exponent of e (see “completing the square” in 
Appendix A); and 

3. Recall that for 6 > 0 and real y 


1 eo 1 fy—a\? 
zrl 00) 3 (4 a= 


tIn statistics it is quite difficult to observe zero correlation between two random variables, even when in 
theory they would be expected to be uncorrelated. The phenomenon of small, random correlations is used 
by hucksters and others to prove a point, which in reality is not valid. 

*Starred material can be omitted on a first reading. 
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(c) 


Figure 4.3-4 Contours of constant density for the joint normal (X = Y = 0): (a) ox = øy, p = 0; 
(b) ox > oy, p=0; (c) ox < oy, p=0; (d) ox = oy; p> 0. 


OEE i = ) 


Indeed after step 2 we obtain 


1 oO 
x 4 —— J p [Epea (y= fees a | (4.3-30) 
/ 20% (1 — P?) J- 202, (1 — 
But the term in curly brackets is unity. Hence 
1 1f2z\? 
z) = ~= e -= | — . 4.3-31 
fx(2) = »| 1A (4331) 
A similar calculation for fy (y) would furnish 
1 Ify 
—~ 1+ epl- 2\]. . 4.3-32 
fr) V2roy | 2 (+) | ( ) 


As we stated earlier, if p = 0, then X and Y are independent. On the other hand as p > +1, 


X and Y become linearly dependent. For simplicity let ox = oy Ê o and Hx = Hy = 0; 
then the contour of constant density becomes 
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x? — 2pry + y? = ĉo’, 


which is a 45-degree tilted ellipse (with respect to the z-axis) for p > 0 and a 135-degree 
tilted ellipse for p < 0. We can generate a coordinate system that is rotated 45 -degrees 
from the x — y system by introducing the coordinate transformation 


v= ty y= ty 


v2 v2 


Then the contour of constant density becomes 


vi- p] + w?[1 + p] = oe, 








which is an ellipse with major and minor diameters parallel to the v and w-axes. If p > 0, 
the major diameter is parallel to the v-axis; if p < 0, the major diameter is parallel to the 
w-axis. As p > +1, the lengths of the major diameters become infinitely long and all of the 
pdf concentrates along the line y = z(p — 1) or y = —2(p > —1). 

Finally by introducing two new RVs 


V2 (x4+Y)/v2 


W 2 (x -Y)/v2, 
we find that as p — 1 


fxv(,y) > a exp |- a x ôly — 2) 


or, equivalently, 





Jarly) > a |-5 (E) x aly 2. 


This degeneration of the joint Gaussian into a pdf of only one variable along the line y = x 
is due to the fact that as p — 1, X and Y become equal. We leave the details as an exercise 
to the student. 

A computer rendition of the joint Gaussian pdf and its contours of constant density is 
shown in Figure 4.3-5 for py = py = 0, ox =ox = 2, and p = 0.9. 


4.4 CHEBYSHEV AND SCHWARZ INEQUALITIES 


The Chebyshevt inequality furnishes a bound on the probability of how much an RV X can 
deviate from its mean value py. 


Theorem 4.4-1 (Chebyshev inequality) Let X be an arbitrary RV with mean py 
and finite variance o°. Then for any 6 > 0 


2 
o 
PIX -uxi 28] < S. (4.41) 


tPafnuty L. Chebyshev (1821-1894), Russian mathematician. 
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Proof Equation 4.4-1 follows directly from the following observation: 


o2 f" (c-X)fx(e)ar > f (2 ~ X} fx (x) dx 


|e-X|>6 
> | fx (x) dx 
|2—-X|>6 


= 8 P[|X —X| > ô]. 


Since 
{|X -X| > d6}U{|X —X| <6} = (N being the certain event), 
and the two events being unioned are disjoint, it follows that 


2 
PIX —X| <6] >1- T (4.4-2) 


Sometimes it is convenient to express ô in terms of g, that is, 6 4 ko, where k is a constant.t 
Then Equations 4.4-1 and 4.4-2 become, respectively, 


PIX -X| > ko] < n (4.4-3) 


— 1 
PIX -X| < koļ]>1- zz m (4.4-4) 





Example 4.4-1 
(deviation from the mean for a Normal RV) Let X: N(ux,07). How do P[|X — ux| < ko] 
and P[|X — ux| > ko| compare with the Chebyshev bound (CB)? 


Solution Using Equations 2.4-14d and 2.4-14e, it is easy to show that P[|X —px| < ko] = 
2erf(k) and P||X — ux| > ko] = 1 — 2erf(k), where erf(k) is defined in Equation 2.4-12. 
Using Table 2.4-1 and Equations 4.4-3 and 4.4-4, we obtain Table 4.4-1. 

From Table 4.4-1 we see that the Chebyshev bound is not very good; however, it must 
be recalled that the bound applies to any RV X as long as ø? exists. 

There are a number of extensions of the Chebyshev inequality!. We consider such an 
extension in what follows. 





Markov Inequality 


Consider an RV X for which fx (z) = 0 for x < 0. Then X is called a nonnegative RV and 
the Markov inequality applies: 


PIX >ô] < FIX) (4.4-5) 


In contrast to the Chebyshev bound, which involves both the mean and variance this bound 
involves only the mean of X. 


İ The Chebyshev inequality is not very useful when k or 6 is small. 
+See Davenport [4-2, p. 256] 
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Table 4.4-1 

k P||X-—X|<ke] CB P[|X—X|>ko] CB 
0 0 0 1 1 
0.5 0.383 0 0.617 1 
1.0 0.683 0 0.317 1 
1.5 0.866 0.556 0.134 0.444 
2.0 0.955 0.750 0.045 0.250 
2.5 0.988 0.840 0.012 0.160 


3.0 0.997 0.889 0.003 0.111 


Proof of Equation 4.4-5 


E[X] = f" zfxia)dz> |" zfxa)daz z8 |" fxla)dz 
> 6P[X > ô] 


whence Equation 4.4-5 follows. Equation 4.4-5 puts a bound on what fraction of a population 
can exceed ô. I 


Example 4.4-2 
(bad resistors) Assume that in the manufacturing of very low grade electrical 1000-ohm 
resistors the average resistance, as determined by a statistical analysis of measurements, 
is indeed 1000 ohms but there is a large variation about this value. If all resistors over 
1500 ohms are to be discarded, what is the maximum fraction of resistors to meet such a 
fate? 





Solution With yy = 1000, and 6 = 1500, we obtain 


1000 
PIX >1 < —— = 0.67. 
[X > 1500] < 1500 0.67 
Thus, if nothing else, the manufacturer has the assurance that the percentage of discarded 


resistors cannot exceed 67 percent of the total. 





The Schwarz Inequality 


We have already encountered the probabilistic form of the Schwarz? inequality in Equation 
4.3-17 repeated here as 


Cov? [X,Y] < E[(X — ux)? JEKY — py)?] 


with equality if and only if Y is a linear function of X. Upon taking the square root of 
both sides of this inequality, we have that the magnitude of covariance between two RVs is 
always upper bounded by the square root of the product of the two variances 

|Cov[X, ¥]| < (oko?) ”. 


+H. Amandus Schwarz (1843-1921), German mathematician. 
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In later work we shall need another version of the Schwarz inequality that is commonly 
used in obtaining results in signal processing and stochastic processes. Consider two 
nonrandom (deterministic) functions h and g not necessarily real valued. Define the norm 
of an ordinary function f by 


1/2 


e(o), (4.4-6) 


whenever the integral exists. Also define the scalar or inner product of h with g, denoted 
by (h, g), as 


(hg) f hla)g"(a)dz 


= (g,h)*. (4.4-7) 


The deterministic form of the Schwarz inequality is then 


I(r, 9)| < [lAllilgll (4.4-8) 


with equality if and only if h is proportional to g, that is, h(x) = ag(x) for some constant a. 
The proof of Equation 4.4-8 is obtained by considering the norm of Ah(z)+g(z) as a function 
of the variable 4, 


JAA(x) + g(x)||? = Al? IAI? + Alh, g) + à* (h, 9)* + lgl? > 0. (4.4-9) 
If we let 
h,g)* 
d= ah (4.4-10) 


Equation 4.4-8 follows. In the special case where h and g are real functions of real RVs, that 
is, h(X), g(X), Equation 4.4-8 still is valid provided that the definitions of norm and inner 
product are modified as follows: 


ya? E f” Kfk (e) de = E (4.411) 
(hg) È f” hia)gla)fx (2) de = BAXI) (4.412) 

whence we obtain 
|E[A(X)g(X)]| < (ER? (X) (El (X). (4.4-13) 


Law of large numbers. A very important application of Chebyshev’s inequality is to 
prove the so-called weak Law of Large Numbers (LLN) that gives conditions under which a 
sample mean converges to the ensemble mean. 
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Example 4.4-3 
(weak law of large numbers) Let X1,...,Xn be iid. RVs with mean yy and variance ox. 
Assume that we don’t know the value of py (or ox) and thus consider the sample mean 


estimator? 
alg 
din = = DX 
i=l 


as an estimator for ux. We can use the Chebyshev inequality to show that ji, is asymptot- 
ically a perfect estimator for of py. First we compute 


Eig] = = > EIX: 


1 
=—-n 

n Hx 
= Hx: 


Next we compute 





ll 
Pa 
3 = 
xo 

3 

S 

xN 


= —0 x. 
Thus, by the Chebyshev inequality (Equation 4.4-1) we have 
P|lån — Hx| 2 ôl] < o% /nd”. 


Clearly for any fixed ô > 0, the right side can be made arbitrarily small by choosing n large 
enough. Thus, 
Jim, Pllin ~ Hx| 2 ô] = 0 


for every 6 > 0. Note though that for 6 small, we may need n quite large to guarantee that 
the probability of the event {|,, — ux| > 6} is sufficiently small. This type of convergence 
is called convergence in probability and is treated more extensively in Chapter 8. 

The LLN is the theoretical basis for estimating jy from measurements. When an exper- 
imenter takes the sample mean of n measurements, he is relying on the LLN in order to 
use the sample mean as an estimate of the unknown mathematical expectation (ensemble 
average) E[X] = py. 











tAn estimator is a function of the observations X 1,X2,...,Xn that estimates a parameter of the 
distribution. Estimators are random variables. When an estimator takes on a particular value, that is, a 
realization, that number is sometimes called the estimate. Estimators are discussed in Chapter 6. 
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Sometimes inequalities can be derived from properties of the pdf. We illustrate with 
the following example due to Yongyi Yang. 


Example 4.4-4 : 
(symmetric RVs) Let the pdf of the real RV X satisfy fx (x) = fx (—x); that is, X is symmet- 
rically distributed around zero. Show that ox > E||X|] with equality if Var(|X|) =0. 





Solution Let Y 2 |X|. Then E[Y?] = E[X?] = u4 + 0% = 0% since py = 0. Also 
E[Y?] = u + 0% = E*||X|] +02 = 0%. But o? > 0. Hence E?||X|] < o% with equality if 
a}, = 0. Such a case arises when the pdf of X has the form fx (x) = $[5(x — a) + 6(z+a)], 
where a is some positive number. Then Y = a, oy = 0, and E||X|] = ox. 





Another inequality is furnished by the Chernoff bound. We discuss this bound in Section 
4.6 after introducing the moment-generating function M(t) in the next section. 


4.5 MOMENT-GENERATING FUNCTIONS 


The moment-generating function (MGF), if it exists, of an RV X is defined byt 


M(t) Ê Elet*] (4.5-1) 
- J ” & fy (x) de, (4.5-2) 


where t is a complex variable. 
For discrete RVs, we can define M(t) using the PMF as 


M(t) = >_ e*™ Px (xi). (4.5-3) 


From Equation 4.5-2 we see that except for a sign reversal in the exponent, the MGF is the 
two-sided Laplace transform of the pdf for which there is a known inversion formula. Thus, 
in general, knowing M(t) is equivalent to knowing fx (xz) and vice versa. 

The main reasons for introducing M(t) are (1) it enables a convenient computation of 
the moments of X; (2) it can be used to estimate fx (x) from experimental measurements 
of the moments; (3) it can be used to solve problems involving the computation of the sums 
of RVs; and (4) it is an important analytical instrument that can be used to demonstrate 
basic results such as the Central Limit Theorem.t 


*The terminology varies (see Feller [4-1], p. 411). 
tTo be discussed in Section 4.7. 
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Proceeding formally, if we expand e** and take expectations, then 


tx tx)” 
Ele’*] = epee Y +... Or a. 
= 1+ tm + Sma +. 40 tin + ooo (4.5-4) 


Since the moments m; may not exist, for example, none of the moments above the first 
exist for the Cauchy pdf, M (t) may not exist. However, if M (t) does exist, computing any 
moment is easily obtained by differentiation. Indeed, if we allow the notation 


then 
mą = M®(0) k=0,1,.... (4.5-5) 


Example 4.5-1 
(Gaussian MGF) Let X : N(u,0?). Its MGF is then given as 


Mx(t) = nS G: »(-; (= - P J) ef da. (4.5-6) 


Using the procedure known as “completing the square”t in the exponent, we can write 
Equation 4.5-6 as 





Mx (t) = exp[pt + 07#? /2] 


1 œ 1 
x TA J exp (-e — (u+ o°t))*) dz. 


But the factor on the second line is unity since it is the integral of a Gaussian pdf with 
mean u + g°t and variance o?. Hence the Gaussian MGF is 





Mx (t) = exp(ut + o7t?/2), (4.5-7) 
from which we obtain 
MY? (0) = 4 


M®)(0) = p +o. 


tSee “Completing the square” in Appendix A. 
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Example 4.5-2  — SSS 
(MGF of binomial) Let B be a binomial RV with parameters n (number of tries), p (prob- 
ability of a success per trial), and q = 1 — p. Then the MGF is given as 


Mp(t) = 5 e! (z) pkqr* 


k=0 
Eure 
k=0 
= (pë + q)”. (4.5-8) 
We obtain 
My? (0) = np =p 
MẸ? (0) = {npe" (pet + q)" + n(n — 1)p?e”*(pet + 4)" exo (4.5-9) 
=npqt p’. 
Hence 
Var|B] = npg. (4.5-10) 


Example 4.5-3 
(MGF of geometric distribution) Let X follow the geometric distribution. Then the PMF 
is Px(n)= a"(1 — a)u(n),n = 0,1,2,... and 0 < a < 1. The MGF is computed as 


= 20 — aja” e" 


=(1-a) Doe" l-a 


= IZ aet’ 





Then the mean p is computed from p = M% (0) = (1 — a)(1 — aet) ?ae™t |:o = a/(1 — a). 








We make the observation that if all the moments exist and are known, then M(t) is 
known as well (see Equations 4.5-4 and 4.5-2). Since Mx (t) is related to fx (z) through the 
Laplace transform, we can, in principle at least, determine fx (x) from its moments if they 
exist.1 In practice, if X is the RV whose pdf is desired and X; represents our ith observation 
of X, then we can estimate the rth moment of X, m,, from 


~ 1 
ftp = — 2X, (4.5-11) 


tFor some distributions not all moments exist. For example, as stated earlier for the Cauchy distribution, 
all moments above the first do not exist. 
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where Mp is called the r-moment estimator and is an RV, and n is the number of obser- 
vations. Even though M, is an RV, its variance becomes small as n becomes large. So for 
n large enough, we can have confidence that M, is reasonably close to m, (a deterministic 
quantity, that is, not an RV). 

The joint MGF Mxy (ti, t2) of two RVs X and Y is defined by 


Mxy (tı, t2) A EfesXtt2¥)) 
co coo 
= J J exp(tiz + toy) fxy(z, y) dz dy. (4.5-12) 


Proceeding as we did in Equation 4.5-4, we can establish with the help of a power series 
expansion that 


titi 
Mxy (ti, t2) = 3 J ai j (4.5-13) 
i=0 j=0 


where m;; is defined in Equation 4.3-13. Using the notation 


ltr 
Meno, 0) ê A OT" Mxy (ti, te) 





Ot! OLB h=to=0, 
we can show from Equation 4.5-12 or 4.5-13 that 
= MP (0,0). (4.5-14) 
In particular 
MQP (0,0) =x,  MẸP (0,0) = py (4.5-15) 
MP (0,0) = E|X?],  M&?)(0,0) = E[¥?] (4.5-16) 
MẸ; (0,0) = mir = CovfX, Y] + ux Hy- (4.5-17) 


4.6 CHERNOFF BOUND 
The Chernoff bound furnishes an upper bound on the tail probability P[X > a], where a is 


some prescribed constant. First note that u(x — a) < e®~2) for any t > 0. Assume that X 
is a continuous RV. Then 


PIX >a}= f fx(a)de 


= T fx (x)u(z — a) dz (4.6-1) 
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and, by the observation made above, it follows that 
P[X >a] < J ~ fx (z) dr (4.6-2) 
-%0 
and this must hold for any t > 0. But, from Equation 4.5-2, 
J Be (a)et(*-9) dz = e- Mx (t), (4.6-3) 
—00 


where the subscript has been added to emphasize that the MGF is associated with X. 
Combining Equations 4.6-3 and 4.6-2, we obtain 


P[X > a] < e*Mx(t). (4.6-4) 
The tightest bound, which occurs when the right-hand side is minimized with respect to t, 
is called the Chernoff bound. We illustrate with some examples. 


Example 4.6-1 
(Chernoff bound to Gaussian) Let X : N (u, 0°) and consider the Chernoff bound on P[X > 
a], where a > u. From Equations 4.5-7 and 4.6-3 we obtain 





P[X > a] < e (4-H) t+07t? /2° 


The minimum of the right-hand side is obtained by differentiating with respect to t and 
occurs when ż = (a — 4)/o7. Hence the Chernoff bound is 


P[X > a] < e7720, (4.6-5) 


The Chernoff bound can be derived for discrete RVs also. For example, assume that 
an RV X takes values X = i, i = 0,1,2,..., with probabilities P[X = i] 4 Px (i). For any 
integers n, k, define 

1, n2>k, 
u(n — k) = f otherwise. 


If follows, therefore, that 


P[X >k] = > Px(n) 


n=k 


= Px (n)u(n — k) 
=0 


co 
< > P(n)  fort>0. 


n=0 
The last line follows from the fact that . 
e-k) > u(n—k) fort >0. 
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We note that 
> Px (nje) — e tk D Px(n)et” 
n=0 n=0 


=e ** Mx (t) (by Equation 4.5-3). 
Hence we establish the result 
P[X > k] < e" Mx (t). (4.6-6) 


As before, the Chernoff bound is determined by minimizing the right-hand side of Equation 
4.6-6. We illustrate with an example. 


Example 4.6-2 —— >s o 
(Chernoff bound for Poisson) Let X be a Poisson RV with parameter a > 0. Compute the 
Chernoff bound for Py (k), where k > a. From homework problem 4.39 we find the MGF 


Mx (t) — eele*—1] 


and 
e ** My (t) = etelae —kt] , 


By setting 

L etk Mx(t)]=0 

dt , 
we find that the minimum is reached when t = tm, where 


k 


tm = ln —. 
a 
Thus with a = 2 and k = 5, we find 


P[X > 5] < e™? exp[5 — 51n(5/2)] 
< 0.2. 


4.7 CHARACTERISTIC FUNCTIONS 


If in Equation 4.5-1 we replace the parameter t by jw, where j 4 V—1, we obtain the 
characteristic function (CF) of X defined by 


A 


x(w) = Elei#*| 


-f fx(z)ei** dz, (4.7-1) 
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which, except for a minus sign difference in the exponent, we recognize as the Fourier 
transform of fx (x). For discrete RVs we can define ®x(w) in terms of the PMF by 


}x(w) = $ el Px (a4). (4.7-2) 


For our purposes, the CF has all the properties of the MGF. The Fourier transform is widely 
used in statistical communication theory, and since the inversion of Equation 4.7-1 is often 
easy to achieve, either by direct integration or through the availability of extensive tables 
of Fourier transforms (e.g., [4-7]), the CF is widely used to solve problems involving the 
computation of the sums of independent RVs. We have seen that the pdf of the sum of 
independent RVs involves the convolution of their pdf’s. Thus if Z = X,+...+Xwyn, where 
X;,i=1,...,N, are independent RVs, the pdf of Z is furnished by 


f2(z) = fx, (2) * fx. (z) *---* fxn (z), (4.7-3) 

that is, the repeated convolution product. 
The actual evaluation of Equation 4.7-3 can be tedious. However, we know from our 
studies of Fourier transforms that the Fourier transform of a convolution product is the 


product of the individual transforms. We illustrate the use of CFs in the following examples. 


Example 4.7-1  — SSS 
(CF of sum) Let Z 2 X; + X2 with fx, (£), fx,(x), and fz(z) denoting the pdf’s of X}, 
X2, and Z, respectively. Show that ®z(w) = x, (w)®x, (w). 


Solution From the main result of Section 3.3 (Equation 3.3-15), we have 
fa) = f frilo)\frale—2) dr 
and the corresponding CF 
ðz(w) = T elu if fx, (2) f(z — 2) da! dz 


= [ fx (z) T fx, (z — 2)e?”* da dz. 


With a change of variable a 2z z, we obtain the CF of the sum Z as 
2 (w) = x, (w)®x, (w). 


This result can be extended to N RVs by induction. Thus if Z = Xı +---+ Xy, then the 
CF of Z would be 


z = ® x, (w) Bx, (w) . .. Bx, (w). 
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Example 4.7-2 


(CF of i.i.d. sum) Let X;, i = 1,...,N, be a sequence of i.i.d. RVs with X : N(0, 1). Compute 
the pdf of 
N 
zyx. 
i=1 


Solution The pdf of Z can be computed by Equation 4.7-3. On the other hand, with 
®x,(w) denoting the CF of X;, we have 


®z(w) = x (w) X... X x(w). (4.7-4) 


However, since the X;’s are iid. N(0,1), the CFs of all the X;s are the same, and we define 
x(w) Ê dx, (w) =... = Bx, (w). Thus, 





lee} 

1 . 

®x(w) = J me ee dz. (4.7-5) 
—oo 


By completing the squares in the exponent, we obtain 
x(w) = l F e- bla? -2jwz+ Gw)? Gu)? gp 
V2T J- 
Co 
= ee J l en 3 (e—ju)? dz. 
= vin 


But the integral can be regarded as the area under a “Gaussian pdf’ with “mean” jw and 
hence its value is unity'. Thus we obtain the CF of X as 


®x(w) = et 
and so the CF of Z is 


2 


®z(w) = [®x(w)]” = e7, (4.7-6) 


From the form of ®z(w) we deduce that fz(z) must also be Gaussian. To obtain fz(z) we 
use the Fourier inversion formula: 


fa(z) = z [ i Dz (w) dw. (4.7-7) 


Inserting Equation 4.7-6 into Equation 4.7-7 and manipulating terms enables us to obtain 





fale) = ei, 
27mm 


Hence fz(z) is indeed Gaussian. The variance of Z is n, and its mean is zero. 


t While this result is not obvious, it can be be rigorously demonstrated using integration in the complex 
plane. 
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Example 4.7-3 
(CF of sum of uniform RVs) Consider two independent RVs X and Y with common pdf 





fx(z) = fy (x) = rect (=) . 


Compute the pdf of Z ê x+Y using CFs. 


\ 
Solution We can, of course, compute fz(z) by convolving fx(xz) and fy(y). However, 
using CFs, we obtain fz(z) from 


fz(z) = = T x (w) by (we dw, 


where 
Dy (w) By (w) = z(w). 


Since the pdf’s of X and Y are the same, we can write 
®(w) Ê x(w) = Py (w) 


/2 
= z e”? dg 


a J_a/2 
_ sin(aw/2) 
~ awf * 
Hence 
. 2 
@z(w) = (==) l (4.7-8) 
and 


fale) = | Ozeda 


—=00 


= Z (1 — ki) rect (=) , (4.7-9) 


which is shown in Figure 4.7-1. The easiest way to obtain the result in Equation 4.7-9 is 
to look up the Fourier transform (or its inverse) of Equation 4.7-8 in a table of elementary 
Fourier transforms. 


As in the case of MGFs, we can compute the moments from the CFs by differentia- 
tion, provided that these exist. If we expand exp(jwX) into a power series and take the 
expectation, we obtain 
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f(z) 


. -a 0 a z 
Figure 4.7-1 The pdf of Z = X + Y when X and Y are independently, identically, and uniformly 
distributed in (—a/2, a/2). 
®x(w) = Ele”*] 
Oo . n 
w 
=) Gey" mp. (4.7-10) 


From Equation 4.7-10 it is easily established that 


1 
Mn = pox 0); (4.7-11) 
where we have used the notation 
a r 
Poi Fox 


Example 4.7-4 
(moment calculation) Compute the first few moments of Y = sin O if O : U[0, 27]. 


Solution We use the result in Equation 4.1-9; that is, if Y = g(X), then 


Hy = T yfy (y)dy = T g(x) fx (x) da. 


Hence 


Ele*Y] = f ” e fy (y)dy 


1 2 . ing 
= zl ej» sinago 
= Jo(w), 


where Jo(w) is the Bessel function of the first kind of order zero. A power series expansion 


of Jo(w) gives 
2 4 
momi- (6) +h) 
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Hence all the odd-order moments are zero. From Equation 4.7-11 we compute 
E[Y?] = m È (-1) 8 (0) = 3 
E[Y4] = mg = (+1)8Ẹ (0) = § 


Example 4.7-5 
(sum of independent binomials) Let X and Y be i.i.d. binomial RVs with parameters n and 
p, that is, 





Px(k) = Pr(k) = (p ) hat 
Compute the PMF of Z = X +Y. 


Solution Since X and Y take on nonnegative integer values, so must Z. We can solve 
this problem by (1) convolution of the pdf’s, which involves delta functions; (2) discrete 
convolution of the PMFs; and (3) CFs. The discrete convolution for this case is 


Pz(k) = 2 Px (i) Py (k — 4) 


n- n n 
=p”? "E (7) (a) for k = 0,1,...,2n. 


The trouble is that we may not immediately recognize the closed form of the sum of products 
of binomial coefficients.t The computation of the PMF of Z by CFs is very simple. First 
observe that 


x(w) = y (w) = -5a (; )p k qr -k 


= (pe + q)”. 
Thus, by virtue of the independence of X and Y, we obtain the CF 
2 (w) = Elexp ju(X + Y)] 
= Elexp(jwX)]Elexp(jwY)| 
= (pe™ +q)”. 


Thus Z is binomial with parameters 2n and p, that is, 
Pz(k) = (7 k p) per- k fork =0,...,2n. 


tRecall that we ran into this problem in Example 3.3-9 in Chapter 3. 
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As a by-product of the computation of Pz(k) by CFs, we obtain the result that 


CP) = (2) (es): 


An extension of this result is the following: If X1, X2,...,XẸn are iid. binomials with 
parameters n, p, then Z = SA Xi is binomial with parameters Nn, p. Regardless of how 
large N gets, Z remains a discrete RV with a binomial PMF.t 


Example 4.7-6 ~~ 
(variance of Poisson) Here we calculate the CF of a Poisson RV and use it to determine 
the variance. Let the RV K be Poisson distributed with PMF 





aF 
Pxr(k) = pe ul), a>0. 
Then the CF is given as 
oO k 
®x(w) = oe teivk 
k! 
k=0 
5 (ac™)" a 
= e 
Fear k 


= exp [a (e” — 1)] . 
Now m2 = E[K?] = R (0) = -902 (0). Taking the indicated derivatives, we get 


Ow) = Ox (w)ajet™ 


and 
(2) _ 2 +jw (1) - +jw 
Oi (w) = Ox (w)aje™” + BY’ (w)aje 
= —Ox(w)aet” + p(w) (ajeti)? . 
So 6@)(0) = -1 x a—1xa?. Hence u, =a+a2. Then since the mean is =a, the 


variance must be 


=Q. 


The variance of the Poisson RV thus equals its mean value. 





tRecall this statement for future reference in connection with the Central Limit Theorem. 
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Note that since the variance of the Poisson RV equals its mean, the standard deviation 
is the square root of the mean. Therefore for large mean values, the distribution becomes 
relatively concentrated around the mean. Another point is that unlike the Normal distri- 
bution the mean and variance of the Poisson RV are not independent parameters i,e., they 
cannot be freely chosen. 


Example 4.7-7 — S 
(a fair game?) A lottery game called “three players for six hits” is played as follows. A bettor 
bets the bank that three baseball players of the bettor’s choosing will get a combined total 
of six hits or more in the games in which they play. Many combinations can lead to a win; 
for example, player A can go hitless in his game, but player B can collect three hits in his 
game, and player C can collect three hits in his game. The players can be on the same team 
or on different teams. The bet is at even odds and the bettor receives back $2 on a bet of $1 
in case of a win. Is this a “fair” game, that is, is the probability of a win close to one-half? 


Solution Let X 1, X2, X3 denote the number of hits by players A, B, C, respectively. 
Clearly X1, X2, Xz are individually binomial. The total number of hits is Y = ya Xi. 
We wish to compute P[Y > 6]. To simplify the problem, assume that each player bats five 
times per game, and their batting averages are the same, say 300 (for those unfamiliar with 
baseball nomenclature, this means that the probability of getting a hit while batting is 0.3). 
Then from the results of Example 4.7-5, we find Y is binomial with parameters n = 15, 
p = 0.3. Thus, 


i=15 
Ply >6= 55 (?) (0.3)*(0.7)5- 


i 
i=6 
es erf(6.76) — erf(0.56) 
= 0.29. 


In arriving at this result, we used the Normal approximation to the binomial as suggested 
in Chapter 1, Section 1.11. The bettor has less than a one-third chance of winning. Despite 
the poor odds, the game can be modified to be fairer to the bettor. Define the RV G as the 
gain to the bettor and define a fair game as one in which the expected gain is zero. Then 
if the bettor were to receive winnings of $2.45 per play instead of $1, we would find that 
E[G] = $2.45 x 0.29 — $1 x 0.71 = 0. Of course if E[G] > 0, then in a sense, the game favors 
the bettor. Some people play the state lottery using this criterion. 





Joint Characteristic Functions 


As in the case of joint MGF's we can define the joint CF by 
i=l 


N 
Èx... Xn (W1,W2,...,WN) =E [ex Ose] . (4.7-12) 


By the Fourier inversion property, the joint pdf is the inverse Fourier transform (with a sign 
reversal) of ®x,...x,(w1,...,ww). Thus, 
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1 o0 oo 
Z1,- EN) = = wae ® Wye. W 
ÍxiXn (Z1,---, EN) ry J. J X1..Xw(W1,---, WN) 
N 


x exp (= Yon) du dwz... dwy. (4.7-13) 


i=1 


We can obtain the moments by differentiation. For instance, with X, Y denoting any two 
RVs (N = 2) we have 


Merk = E[XTY*] = (-7)"t "OEY (0, 0), (4.7-14) 


where 


Ot @ xy (wi, wa) 
Ow Owk 





ach (0,0) £ (4.7-15) 





@1=W2=0 


Finally for discrete RVs we can define the joint CF in terms of the joint PMF. For instance 
for two RVs, X and Y, we obtain 


P xy (wi, w2) 2y D elwizitways) Pyy (ai, yj). (4.7-16) 
i j 


Example 4.7-8 —— S > 
(joint CF of i.i.d. Normal RVs) Compute the joint characteristic function of X and Y if 


1 l, > 2 
fey = 70m | z3 +|. 


Solution Applying the definition in Equation 4.7-12, we get 


1 oo o , , 
Oxy (wi, w2) = x | J eTil +a) ejnetjwy dy dy, 
00 


Completing the squares in both x and y, we get 


®xy (wi, we) = en Btw) fie e7 alet -2jwizt(jw)?] TE dz 
, oo V20 
ange a d 
x | e` aly? -2jway+(jw2)?]_ CY 
—oo Von 
— e-i (ottu) J O aale- AE J O Ruwa)? AY 
—oco V2T J- Vv2r 


=e ~a lwi tws), 


since the integrals are the areas under unit-variance Gaussian curves. 
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Example 4.7-9 ——— — > o> 
(joint CF of two discrete RVs) Compute the joint CF of the discrete RVs X and Y if the 


joint PMF is 


L, k=l1=0, 
Pxv (ky) Ł, k = 41,1 =0, 
XY W“, = 
4, k=l=41, 
0, else. 


Solution Using Equation 4.7-16 we obtain 


1 


1 
xy (wi, w2) = 5 > elwiktwal) py (k,l) 
k=-1l=-1 


-i,: +S cos(w + we) 
= tz cuts 1 +w). 


From Equations 4.7-14 and 4.7-15 we obtain, since zy = Hy = 0, 











A . 1 
o% = mao = —(—3)? [cos w1 + cos(w + w2)] 3 
wi =w2=0 
— 2. 
=3; 
2A 2 1 
oy = mo: = —(—J) 3 cos(w; + w2) 
wi =w2=0 
— 1 
= 33 
gl 
miu = —(-3) 3 cos(w + w2) 
wi =w0o=0 
_ 1 
= 3: 


Hence the correlation coefficient p is computed to be 


1 
Mii 3 1 


P= xoy VA V3 ~ Va 
3 3 
Example 4.7-10 


(joint CF of correlated Normal RVs) As another example we compute the joint CF of X 
and Y with 








1 x? +y? =a) 
2, y) = — = exp | -1—4 
fxy(z,y) P ( a1 = 2) 
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To solve this problem we use two facts: 
(1) A zero-mean Gaussian RV Z with variance 0% has CF 


Ele?”2] = exp -502l (4.7-17) 
and, in particular, with w = 1, 


Ee?7] = exp |-522| . (4.7-18) 


1 2 
Proof of fact (1) Use the definition of the CF with fz(z) = (210%,)—1/? exp (- z =) 
Z 
and apply the complete-the-square technique described in Appendix A. 
(2) If X and Y are zero-mean jointly Gaussian RVs, then for any real w1, we, the RVs 


Z 4 wi X +weY 
wêx 
are jointly Gaussian and, as a direct by-product, the marginal density of Z is Gaussian. 


Proof of fact (2) Simply use Equation 3.4-11 or 3.4-12 to compute fzw (z, w). One 
easily finds that Z, W are jointly Gaussian and that, therefore, the marginal pdf of Z alone 
is Gaussian with Z = 0. The variance of Z is computed as 


Var(Z) = E(w X + wY)?] 
= w? Var[X] + w2Var[Y] + 2wiweXY. 


With 02, = 02 Ê 1, we obtain 03 = w? + w3 + Quiwep. 
Finally recalling that Z = w1 X + w2Y and using Equation 4.7-18, we write 


EfeiwiX+w2¥)) = eo 2 Wi tw t+2uiwap) (4.7-19) 


Equation 4.7-19 is the joint CF of two zero-mean, unity variance correlated Gaussian RVs. 
When p = 0, the RVs become uncorrelated and therefore independent and we obtain the 
result in Example 4.7-8. 

The extension to more than two discrete RVs is straightforward, although the notation 
becomes a little clumsy, unless matrices are introduced. 





The Central Limit Theorem 


It is sometimes said that the sum of a large number of RVs tends toward the Norma]. Under 
what conditions is this true? The Central Limit Theorem deals with this important point. 
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Basically the Central Limit Theorem? says that the normalized sum of a large number 


of mutually independent RVs X1,..., Xn with zero means and finite variances o?,...,02 


tends to the Normal CDF provided that the individual variances o, k = 1,...,n, are 
small compared to 577, 77. The constraints on the variances are known as the Lindeberg 
conditions and are discussed in detail by Feller [4-1, p. 262]. We state a general form of the 
Central Limit Theorem in the following and furnish a proof for a special case. 


Theorem 4.7-1 Let X),...,X, be n mutually independent (scalar) RVs with CDFs 
Fy, (21), Fx,(22),.--, Fx, (£n), respectively, such that 


Bx, =0, — Var[X;] = of 


and let 
82 2624... 402. 


If for a given £ > 0 and n sufficiently large the ox satisfy 
Ok < ESn, k=1,...,n, 


then the normalized sum 
Zn Ê (X1 +... + Xn)/8n 


converges to the standard Normal CDF, denoted by 1/2 +erf(z), that is, limp—oo Fz, (z) = 
1/2 + erf(z). This is called convergence in distribution. $ 


A discussion of convergence in distribution is given later in this section. 
We now prove a special case of the foregoing. 


Theorem 4.7-2 Let X,, X2,.--, Xn be iid. RVs with wy; = 0, and Var[Xi] = 1, 
i=1,...,n. Then 


tends to the Normal in the sense that its CF z, satisfies 
lim ®z, (w) = ene 
noo 


which is the CF of the N(0,1) RV. 


Proof Let W; 2 X;//n. Also, let ®x,(w) and fx,(z) be the CF and pdf of X, 
respectively. Then 


tFirst proved by Abraham De Moivre in 1733 for the special case of Bernoulli RVs. A more general 
proof was furnished by J. W. Lindeberg in Mathematische Zeitschrift, vol. 15 (1922), pp. 211-225. 
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by, = Ele] 


= Efeio/V™) Xs) 


= oy, (=) 


Since ®x,(w) and ®w, (w) do not depend on i, we write ® x, (w) 2 Ox (w) and ®y, (w) 2 
w (w). From calculus we know that any function ®(w) whose derivative exists in a neigh- 
borhood about wọ can be represented by a Taylor series 


where ©“) (wo) is the lth derivative of &(w) at wo. Moreover, if the derivatives are continuous 
in the interval |wo, w], B(w) can be expressed as a finite Taylor series plus a remainder A; (w), 
that is, 

L-1 


(0) = Y gP! (wo)(w wo) + Ar (w), 
i=0 ` 


where 1 
SP" (lw — wo)” 


and € is some point in the interval [wo,w]. Let us apply this result to #w (w) with wo = 0. 
Then 


Ar(w) 


Py (w) = T ele! V" fy (x) dx 











#6) (0) =1 
oy (0) = Ji Tetti V7 fy (x) de = 
Wos S (F) eee pe(a)as oa 
Hence , 
Bwl) = 1- soa? + a, 
where 


p(w) 4 -j f greis*/V™ fy (x) dx/6. 


Since Zn = $; Wi, we obtain 
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or 


ln z, (w) = nln w (w). 
Now recall that for any h such that |A| < 1, 


h? h3 
In(l+h)=h- +7: 


For any fized w, we can choose an n large enough so that (let R3 2 5(w)) 


as tava 

















Assuming this to have been done, we can write 





nyn 
EEEE 


—1/2 „—1 „—3/2 
/ yn 3/2 


wo R 
=nln|l1- — 
mnz, (w)=n a| om t | 








= -5 + terms involving factors of n n 
Hence 
w2 
lim [ln ®z, (w)] = -= 
n—oo 2 


or, equivalently, 
2 
lim z, (w) = e™ /?, 
no 
which is the CF of the N(0,1) RV. Note that to argue that lim, fz, (z) is the normal 
pdf we should have to argue that 


lim @z, (w) 2 lim ‘i fzal ajetždz) 
n—oo noo 


af" (Jim, fz,(2)) ef”? dz. 


However, the operations of limiting and integrating are not always interchangeable. Hence 
we cannot say that the pdf of Z, converges to N(0,1). Indeed we already know from 
Example 4.7-5 that the sum of n i.i.d. binomial RVs is binomial regardless of how large n 
is; moreover, the binomial PMF or pdf is a discontinuous function while the Gaussian is 
continuous and no matter how large n is, this fact cannot be altered. However, the integrals 
of the binomial pdf, for large n, behave like integrals of the Gaussian pdf. This is why the 
distribution function of Z„ tends to a Gaussian distribution function but not necessarily to 
a Gaussian pdf. 
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The astute reader will have noticed that in the prior development we showed the normal 
convergence of the CF but not as yet the normal convergence of the CDF. To prove the 
latter true we can use a continuity theorem’ which states the following: Consider a sequence 
of RVs Z;,i = 1,...,n, with CFs and CDFs ©,;(w) and Fj(z),i=1,...,n, respectively, with 
p(w) â limno ®,,(w) and P(w) continuous at w = 0; then F(z) = limpoo Fn(z). E 
Example 4.7-11 —— See 
(application of the Central Limit Theorem [CLT]) Let X;,i = 1,...,n, be a sequence of 
iid. RVs with E[X;] = wx and Var[X;] = 0%. Let Y £ Dia X: where n is large. We wish 
to compute Pla < Y < b] using the CLT. With Z 2 (Y — E[Y])/oy, and cy > 0, 


Pja<Y <b] = Pld < Z <b, 


where 
a’ ê a— EY] 
oy 
OY 
and 
ay = vnox. 


Note that Z is a zero-mean, unity variance RV involving the sum of a large number (n 
assumed large) of i.i.d. RVs. Indeed with some minor manipulations we can write Z as 


1 Š Zex) 
Z= . 
Wad (as 





Hence 


1 fË ag 
Pla’ < Z < b] =~ — =] e? dz. 
~ Vr Jas 





Although the CLT might be more appropriately called the “Normal convergence theorem,” 
the word central in Central Limit Theorem is useful as a reminder that CDFs converge to 
the normal CDF around the center, that is, around the mean. Although all CDFs converge 
together at +oo, it is in fact in the tails that the CLT frequently gives the poorest estimates 
of the correct probabilities, if these are small. An illustration of this phenomenon is given 
in Problem 4.59. 

In a type of computer-based engineering analysis called Monte-Carlo simulation, it 
is often necessary to have access to random numbers. There are several random number 
generators available in software that generate numbers that appear random but in fact are 
not: They are generated using an algorithm that is completely deterministic and therefore 


tSee Feller [4-1, p. 508]. 
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they can be duplicated by anyone who has a copy of the algorithm. The numbers, called 
pseudo-random numbers, are often adequate for situations where not too many random 
numbers are needed. For situations where a very large number of random numbers are 
needed, for example, modeling atomic processes, it turns out that it is difficult to find an 
adequate random number generator. Most. will eventually display number sequences that 
repeat, that is, are periodic, are highly correlated, or show other biases. Note that the 
alternative, that is, using a naturally random process such as the emission of photons from 
x-ray sources or the liberation of photoelectrons from valence bands in photodetectors, also 
suffers from a major problem: We cannot be certain what underlying probability law is truly 
at work. And even if we knew what law was at work, the very act of counting photons or 
photoelectrons might bias the distribution of random numbers. 

In any case, if we assume that for our purposes the uniform random number generators 
(URNG) commonly available with most PC software packages are adequate in that they 
create unbiased realizations of a uniform RV X, the next question is how can we convert 
uniform random numbers, that is, those that are assumed to obey the uniform pdf in (0, 1), 
to Gaussian random numbers. For this purpose we can use the CLT as follows. Let X; 
represent the ith random number generated by the URNG. Then 


Z=Xi+...4+Xn 


will be approximately Gaussian for a reasonably large n (say >10). Note that the pdf of 
Z is the n-repeated convolution of a unit pulse which starts to look like a Gaussian very 
quickly everywhere except in the tails. The reason there is a problem in the tails is that Z 
is confined to the range 0 < Z < n while if Z were a true Gaussian RV, then —oo < Z < oo. 


4.8 ADDITIONAL EXAMPLES 


Example 4.8-1 


Let X;,2=1,...,n, be n iid. Bernoulli RVs with individual PMF: 
p”(1 7 p) 7, T = 0, 1 
0, else. 


Px,(z) = { 


n 
Show that Z = > X; is binomial with PMF 6(k;n,p) & (z) p¥gr-*. 


i=1 


1 . 
Solution The CF of the Bernoulli RV is computed as $x,(w) = $ e”*p7(1— p)! 7? = 


T= 


pel” +q, where q = 1 — p. From Equation 4.7-4, we obtain that 


&z(w) = | J (we +1) =(pe + 1)", 


i=l 


which, from Example 4.7-5, we recognize as the CF of the binomial RV with PMF as above. 
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Example 4.8-2 
Let Z be a binomial RV with PMF b(k; n, p) £ k pq” F, where n > 1, and consider 














the event {a < Z < b}, where a and b are numbers. Use the CLT to compute Pla < Z < b]. 


Solution From Example 4.8-1 we know Z can be resolved as a sum of n i.i.d. Bernoulli 
RVs. Thus we write Z = X +- -- Xn, where E[Z] = np and Var [Z] =npq > pq when n 
is large. The situation is now ripe for applying Theorem 4.7-1, the Central Limit Theorem. 


The event {a < Z < b} is identical to the event {S382 < Sa < a) With a’ £ 
a—np p A b—np 7& ! 1 7 

apa f= Japa’ and Z’ = age the event can be rewritten as {a’ < Z’ < b'}, where Z’ is 
a zero-mean, unit-variance RV. Then from Example 4.7-11, which uses a formula based on 


the CLT, we get 











b’ 
Pjla< Z <b = — exp[-32]az. 


Van 


a! 


In terms of the standard Normal distribution, Fsy (x), defined in Equation 1.11-3, this 
result can be written as 





b— np a — np 
Pla<Z<bæœ= F F . 
pszen= Row Tre] Rm | asa 


The correction factor of 0.5 in the limits in Equation 1.11-5 is insignificant when n > 1. 


Example 4.8-3 

Let Z be a binomial RV with mean np and standard deviation ,/npq. Use the Normal 
approximation furnished by the CLT to compute the probability of the following events: 
{np — Jnpq < Z < np + npg}, {np — 2,/npg < Z < np + 2/npq}, {np — 3,/npq < Z < 
np + 3,/npg}. 








Solution With the change of variable zi ê ae the three events are converted to 


{-1< Z’ <1}, {-2< Z' <2}, {-3 < Z' < 3}. The RV Z’ is zero-mean, unit variance and 
the Normal approximation furnished by the CLT yields: 


P[-1< Z' < 1] = Fsn(1) — Fsn (—1) =% 0.683 
P|- 2 < Zz’ < 2] = Fyn (2) — Fn (—2) = 0.954 
P|- 3< Z'< 3] = Fgn (3) - Fsn(-3) = 0.997. 


Note that the last-listed event is (almost) certain to occur. In a thousand repetitions it 
will on the average fail to occur only three times. 
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Example 4.8-4 o 
Let X;,i = 1,...,100, be i.i.d. Poisson RVs with PMF P[k] = e7?2,,k = 0,1,2, .... 


100 
Here k is the number of events in a given interval of time. Let Z = >> X;. We note that 
i=1 
E[Z] = 200 and Var[Z]=200. This situation might reflect the summed data packets collected 
at a receiver from identical multiple channels. Use the CLT to compute the probability of 


the event {190 < Z < 210}. 


Solution Since Z is the sum of a large number of i.i.d. RVs and the variance of any of these 
is much smaller than the variance of the sum, the CLT permits us to use the Normal approx- 
imation to compute the probability of this event. Define the RV Z' = 2-20 , which is zero- 
mean and unity variance. Then, in terms of Z’, the event becomes {—0.707 < Z’ < 0.707}. 


The Normal approximation yields Fsy (0.707) — Fsn(—0.707) = 0.52. 











SUMMARY 


In this chapter we discussed the various averages of one or more random variables (RVs) and 
the implication of those averages. We began by defining the average or expected value of an 
RV X and then showed that the expected value of Y = g(.\ } could be computed directly 
from the pdf or PMF of X. We briefly discussed the important notion of conditional expec- 
tation and showed how the expected value of an RV could be advantageously computed by 
averaging over its conditional expectation. We then argued that a single summary number 
such as the average value, py, of X was insufficient for describing the behavior of X. This 
led to the introduction of moments, that is, the average of powers of X. We illustrated how 
moments can be used to estimate pdf’s by the maximum entropy principle and introduced 
the concept of joint moments. We showed how the covariance of two RVs could be inter- 
preted as a measure of how well we can predict. one RV from observing another using a 
linear predictor model. By giving a counterexample, we demonstrated that uncorrelated- 
ness does not imply independence of two RVs, the latter being a stronger condition. The 
joint Gaussian pdf for two RVs was discussed, and it was shown that in the Gaussian case, 
independence and uncorrelatedness are equivalent. We then introduced the reader to some 
important bounds and inequalities known as the Chebyshev and Schwarz inequalities and 
the Chernoff bound and illustrated how these are used in problems in probability. 

The second half of the chapter dealt mostly with moment generating functions (MGF's) 
and characteristic functions (CFs) and the Central Limit Theorem (CLT). We showed how 
the MGF and CF are essentially the Laplace and Fourier transforms, respectively, of the 
pdf of an RV and how we could compute all the moments, provided that these exist, from 
either of these functions. Several properties of these important functions were explored. We 
illustrated how the CF could be used to solve problems involving the computation of the 
pdf’s of the sums of RVs. 

We then discussed the CLT, one of the most important results in probability theory, and 
the basis for the ubiquitous Normal behavior of many random phenomena. The CLT states 
that under relatively loose mathematical constraints, the cumulative distribution function 
(CDF) of the sum of independent RVs tends toward the Normal CDF. 

We ended the chapter with additional examples of the use and application of the CLT. 
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PROBLEMS 
(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
4.1 Compute the average and standard deviation of the following set: 3.50, 5.61, —2.37, 
4.94, —6.25, —1.05, —3.75, 5.81, 2.27, 0.54, 6.11, —2.56. 
4.2 Compute E[X] when X is a Bernoulli RV, that is, 
x= 1, Px(1)=p>0, 
~ (0, Px(0)=1-p>0. 
4.3 Let X =a (a constant). Prove that E[Y] = a. 
4.4 Consider a discrete random variable X whose pmf is given by 
_ f 1/3, 2=—1,0,1 
fx(z) = f otherwise 
Compute E[X]. 
4.5 Let X be a uniform RV, that is, 
_f(b-a)1, 0<a<rced, 
Ix(x) = [t otherwise. 
Compute E[X]. 
27, 0<z<1, 
4.6 Let the pdf of X.be fx(z) = i else. 
(i) Compute Fy (x); 
(ii) Compute E[X]; 
(iii) Compute 0%. 
m n 
, . zj\k-zr . , 
4.7 Find E[X] if Px(x) = mE ,£ = 0,1, ..., k and 0, else. This PMF is called 
n 
k 
the hypergeometric distribution and m,n, k are positive integers. 
4.8 In Problem 4.5, let Y 2 ye, Compute the pdf of Y and E[Y] by Equation 4.1-8. 
. Then compute E[Y] by Equation 4.1-9. 
4.9 Let Y Ê X? +1. Compute E[Y] and o2, if 
_J2z,0<z<1, 
fx(z)= l else. 
4.10 Let X be a Poisson RV with parameter a. Compute E[Y] when Y 2 X2 4b. 
4.11 Show that the mean of the Gaussian RV X :N(p,0?) is p. Start from the defining 


integral for the mean. 
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4.12 


4.13 


4.14 


4.15 


4.16 


4.17 


4.18 








In your physics courses, you have studied the concept of momentum p = mv in the 
deterministic that is, nonrandom sense. In reality, measurements of mass m and 
velocity v are never precise, thereby giving rise to an unavoidable uncertainty in 
these quantities. In this problem, we treat these quantities as RVs. So, consider 
an RV mass M with given pdf fm(m) and an RV velocity V with given pdf fy (v). 
We are also given the averages py = E[M] and py = E[V] (that would presumably 
correspond to our measurements in the physics course). Assume that M and V are 
independent and nonnegative RVs. 


(a) Express the pdf of the momentum P = MV in terms of the known pdf’s 
fulm) and fvw). 
(b) Determine the expected value of the momentum pp = E[P] as a function of 
Hm and py. 
Prove that if E[X] exists and X is a continuous RV, then |E[X]| < E[|X|]. Repeat 
for X discrete. 
Show that if E[g;(X)] exists for i = 1,..., N, then 


E È s00] = > Flgi(X)]- 


A random sample of 20 households shows the following numbers of children per 
household: 3, 2, 0, 1, 0, 0, 3, 2, 5, 0, 1, 1, 2, 0, 1, 0, 0, 0, 6, 3. (a) For this set what 
is the average number of children per household? (b) What is the average number of 
children in households given that there is at least one child? 

Let BS {a < X < b}. Derive a general expression for E[X|B] if X is a continuous 
RV. Let X : N(0,1) with B = {-1 < X <2}. Compute E[X|B}. 

(Papoulis [4-3]). Let Y = A(X). We wish to compute approximation to E[h(X)] 
and E[h?(X)]. Assume that A(x) admits to a power series expansions, that is, all 
derivatives exist. Assume further that all derivatives above the second are small 
enough to be omitted. Given that E[X] = p and Var(X) = 07, show that 


(a) E[h(X)] = hp) + h” (u)o? /2; 
(b) Blh?(X)] = h?(u) + (h (u)? +hlu)h” (u))o?. 
The joint pdf of a bivariate random variable (X,Y) is given by 


_ f2, O<y<aK<l 
fxy(z,y) = f otherwise 


(a) Find the conditional pdf of Y given X = x denoted by fy/x(y/zx) 
(b) Find the conditional mean of Y given X = a, ie. E[Y/z] 
(c) Compute the mean E[Y]. 
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4.19 


4.20 


4.21 


4.22 


4.23 


4.24 


A particular model of an HDTV is manufactured in three different plants, say, A, B, 
and C, of the same company. Because the workers at A, B, and C are not equally 
experienced, the quality of the units differs from plant to plant. The pdf’s of the 
time-to-failure X, in years, are 


fx(z) = 5 exp(—1/5)u(z) for A 


fx(z) = z5 exp(—2/6.5)u(z) for B 


fx(z)= + exp(—2/10)u(z) for C, 


where u(x) is the unit step. Plant A produces three times as many units as B, 
which produces twice as many as C. The TVs are all sent to a central warehouse, 
intermingled, and shipped to retail stores all around the country. What is the expected 
lifetime of a unit purchased at random? 
A source transmits a signal O with pdf 


_ f (2r), 0<80<2r7, 
fe(@) = [e otherwise. 


Because of additive Gaussian noise, the pdf of the received signal Y when © = @ is 


1 -0y 
fyiolylð) = Vno? P -3 (=) | . 


Compute E[Y]. 

Compute the variance of X if X is (a) Bernoulli; (b) binomial; (c) Poisson; (d) Gaus- 
sian; (e) Rayleigh. 

An Internet Service Provider (ISP) has two types of servers that route incoming 
packets for its customers. The servers fail randomly and have been found to have 
time-to-failure distributions that are exponential with parameters u, and j3, respec- 
tively. Call these two RV failure times T) and T2, and assume they are independent. 
Thirty percent of the servers are type 1 and 70 percent are type 2. If a server is 
picked at random, denote its time-to-failure by the RV T. 


(a) What is E[T]? 

(b) What is E[T?]? 

(c) What is the standard deviation ør? 
Let X and Y be independent RVs, each N(0,1). Find the mean and variance of 
ZVX FY.. 
Let X1, X2, X3 be three i.i.d. standard Normal RVs. We order them as Y; < Yz < Y3. 


a) Compute fy, yzy; (Y1, Y2, Y3); 
b) Compute E[Yi] i = 1,2,3. 
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4.25 
4.26 


4.27 


4.28 


4.29 


4.30 


4.31 


Let fxy (x,y) = 2 for 0 <x < y < 1 and zero else. Compute E[Y] and 02. 
Let fxy(z,y) be given by 


fav (,y) 1 ex T? + y* — 2pxry 
zr = —— e -— > YI 
nora 202 — p) 


where |p| < 1. Show that Z[Y] = 0 but E[Y|X = z] = px. What does this result say 
about predicting the value of Y upon observing the value of X? 
Let X and Y be two Gaussian RVs with mean 0 and variance a7. Let 


zê 5(X+Y). 


(a) If X and Y are independent, what are the mean and variance of Z ? 

(b) Suppose X and Y are no longer independent. Let p be the correlation coeffi- 
cient of X and Y. Now, what would be the mean and variance of Z ? (Your 
answer may be in terms of p)? 

(c) Consider what happens when p = —1, p = 0, and p = +1. Is it always true 
that 

Show that in the joint Gaussian pdf with py = py = 0 and ox =oy 4 o, the joint 
pdf asymptotically as p — 1, becomes 





fev (au) > oa exp |-5 (Z) ] dtu 2) 


Consider a probability space = (Q, Z P). Let Q = {¢),...,¢s} = {-1, —},0, 3 1} 
with P[{¢;}] = 4,i=1,...,5. Define two RVs on F as follows: 


XOQ and YQ. 


(a) Show that X and Y are dependent RVs. 
(b) Show that X and Y are uncorrelated. 


Given the conditional Gaussian density 
_ 2 
exp (-2 ax) ) . 


202 





Fy\x (ylz) = = 


for two RVs X and Y, what is the conditional mean E[Y|X]? Here a is a known 
constant. 
We wish to estimate the pdf of X with a function p(x) that maximizes the entropy 


H|X] 4 - J * (a) In p(z) dz. 


It is known from measurements that E[X] = p and Var[X] = o?. Find the maximum 
entropy estimate of the pdf of X. 
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4.32 


4.33 


4.34 


4.35 


4.36 


4.37 


4.38 


4.39 
4.40 
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Let X : N(0, 07). Show that 


Mn S E[X"] =1-3...(n—-1)o” n even (4.8-1) 
mr, = 0 n odd. (4.8-2) 


With py 2 E|X] and py 4 E[Y], show that if c11 = ./c20Co2, then 


B (Se - 0 - Y- m))| =0. 


€20 


Use this result to show that when |p| = 1, Y is a linear function of X, that is, 
Y =aX + b. Relate a, 8 to the moments of X and Y. 
Show that in the optimum linear predictor in Example 4.3-4 the smallest mean-square 
error is 

Emin = Oy (1 — p”). 
Explain why ¢2,,,, = 0 when |p| = 1. 
We are given an RV X with pdf fx(x) = 1 — (1/2)z, for 0 < z < 2 and zero else. 
Compute m,, the rth moment of X for r a positive integer. 
Let E[X;] = u, Var[X;] = o?. We wish to estimate u with the sample mean 


1 N 
yx 
i=1 


Compute the mean and variance of fy assuming the X; for i = 1,..., N are inde- 
pendent. 
In the previous problem, how large should N be so that 


l> 
z| 


fy 


Pllfin — u| > 0.10] < 0.01. 


Let X be a uniform RV in (—3, 4). Compute (a) its moment-generating function; 


and (b) its mean by Equation 4.5-5. [Hint: sinh z £ (e7 — e™7)/2. Use limits when 
computing the mean.] 

Let X be a Poisson RV. Compute its (a) MGF; and (b) its mean by Equation 4.5-5. 
The negative binomial distribution with parameters N, Q, P, where Q — P = 1, 
P>O, and N > 1, is defined by PMF 


we (mE DEY -EY (k = 0,1,2,...). 


It is sometimes used as an alternative to the Poisson distribution when one cannot 
guarantee that individual events occur independently (the “strict” randomness 
requirement for the Poisson distribution). Show that the moment-generating func- 
tion is 
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4.41 


4.42 
4.43 


4.44 


4.45 


4.46 


4.47 


4.48 


4.49 


4.50 











Mx (t) = (Q = Pe). 


[Hint: Either compute or look up the expansion formula for (Q—Pe*)—, for example, 

see Discrete Distributions by N. L. Johnson and S. Kotz, John Wiley and Sons, 1969.] 
atiy\—1 _ 

Let X have pdf fx (z;a, 8) = f(e ) zoel xz/ß), 0< z <œ, >0, a20, 

Find the moment-generating function of X. This is the gamma distribution. 

Find the mean and variance of X if X has a gamma distribution. 

Compute the Chernoff bound on P[X > a], where X is an RV that satisfies the 

exponential law fx (r) = Ae7** u(x). 

Let N = 1 in Problem 4.40. (a) Compute the Chernoff bound on P[X > k]; (b) gener- 

alize the result for arbitrary N. 


Let X have a Cauchy pdf 
a 


fxe) = Tay ge 

Compute the CF ®y(w) of X. 

Let X have the Cauchy density: fx (x) = (x(1+ (x — a)?)) ,-00 < x < oo. Find 
E|X]. What problem do you run into when trying to compute of? 

Find the CF of the exponential RV X with mean p > 0, that is, 


fx(x) = 5 Ttue), 


where u(x) denotes the unit-step function. 
Find the characteristic function of a Cauchy random variable with pdf 


fx (zx) k <2< 
= a 700 oo 
x T(z? + a?) 


If X,,X2,..., Xn are n independent Cauchy random variables with the above pdf 
n 
and Y, = 1/n J Xi 
i=1 
(a) Find the pdf of Y, 
(b) Does the Central Limit Theorem hold for Yp? 
Let X be uniform over (—a,a). Let Y be independent of X and uniform over 
(In — 2]a, na), n = 1,2,.... Compute the expected value of Z = X +Y for each 
n. From this result sketch the pdf of Z. What is the only effect of n? 
Consider the recursion known as a first-order moving average given by 


Xn = Zn — AZn_-1 la| <1, 
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4.51 


4.52 
4.53 


4.54 


4.55 


4.56 


4.57 


4.58 


4.59 


where Xn, Zn, Zn—1 are all RVs for n =..., —1,0,1,.... Assume E|Zn] = 0 all n; 
E|ZaZ;] = 0 all n # j; and E[Z2] = o? all n. Compute Rp(k) 2 E[XnXn—x] for 
k= 0, +1, +2,.... 

Consider the recursion known as a first-order autoregression 


Xn =bXn-1+Zn l< 


The following is assumed true: E[Z,] = 0, E[Z2] = o? all n; E[Z,Z;] = 0 all n # j. 
Also E[ZnXn-;] = 0 for j = 1,2,.... Compute R,(k) = E[XnXn—,] for k = 1, 
+2,.... Assume E|X?] 2k independent of n. 

Give an example of two random variables which are uncorrelated but not independent. 
Let fxy (x,y) = 4exp(—4[x+y]), z > 0,y > 0. Find the joint MGF and CF function 
of (X,Y). 

Let X and Y be two independent Poisson RVs with 


Compute the PMF of Z = X +Y using MGFs or CFs. 

Your company manufactures toaster ovens. Let the probability that a toaster oven has 
a dent or scratch be p = 0.05. Assume different ovens get dented or scratched indepen- 
dently. In one week the company makes 2000 of these ovens. What is the approximate 
probability that in this week more than 110 ovens are dented or scratched? 

Message length L (in bytes) on a network can be modeled as an i.i.d. exponential RV 
with CDF 

1— e70 0021 l> 0, 


Pins) 2 mi) ={ 0, <0. 


(a) What is the expected length (in bytes) of the file necessary to store 400 
messages? 

(b) What is the probability that the average length of 400 randomly-chosen 
messages exceeds 520 bytes? 


Use Chebyshev’s inequality to find how many times a fair coin must be tossed in 
order that the probability that the ratio of the number of heads to the number of 
tosses will lie between 0.45 and 0.55, will be at least 0.95. 

A distribution with unknown mean p has a variance equal to 1.5. Use Central Limit 
Theorem to find how large a sample should be taken from the distribution in order 
that the probability be at least 0.95 that the sample mean will be within 0.5 of the 
population mean. 

Let X; for i = 1,...,n be a sequence of i.i.d. Bernoulli RVs with Px(1) = p and 
Px (0) = q = 1 — p. Let the event of a {1} be a success and the event of a {0} be a 
failure. 
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4.60 


4.61 


4.62 


4.63 


4.64 


4.65 


(a) Show that 
alg 
Zn = eS 5 Wi, 
vV 


where W; 4 (X: — p)/\/pq, is a zero-mean, unity variance RV with a Normal 
CDF when n >> 1. 

(b) For n = 2000 and k = 110, 130, 150 compute P[k successes in n tries] 
using (i) the exact binomial expression; (ii) the Poisson approximation to the 
binomial; and (iii) the CLT approximations. Do this by writing three MATLAB 
miniprograms. Verify that as the correct probabilities decrease, the error in 
the CLT approximation increases. 


In Chapter 1, the following problem was solved using an approximation to the bino- 
mial probability law. 

Assume that code errors in a computer program occur as follows: A line of code contains 
errors with probability p = 0.001 and is error free with probability g = 0.999. Also errors 
in different lines occur independently. In a 1000-line program, what is the approximate 
probability of finding 2 or more erroneous lines? 

Can the Central Limit Theorem be used here to give an approximate answer? Why 
or why not? Explain your answer. 

Assume that we have uniform random number generator (URNG) that is well 
modeled by a sequence of i.i.d. uniform RVs X;, i = 1,...,n, where X; is the ith 
output of the URNG. Assume that 


fx, (Xi) = rect (=£) . 


(a) Show that with Zn = Xi +... + Xn, E[Z,] = na/2. (b) Show that Var(Z,) = 
na?/12. (c) Write a MATLAB program that computes the plots fz,(z) for n = 


2,3, 10,20. (d) Write a MATLAB program that plots Gaussian pdf’s N (%, za) 


for n = 2,3,10,20 and compare fz,(z) with N (2, ze) for each n. (e) For each 


n compute Plu, — kon < Zn < Hn + kan], where u, = na/2, o? = na?/12 for 

a few values of k, for example, k = 0.1,0.5,1,2,3. Do this using both fz, (z) and 
F, ng). Choose any reasonable value of a, for example, a = 1. 

Let fx(z) be the pdf of a real, continuous RV X. Show that if fx (x) = fx(—x), then 

E|X] =0. 

Let random variables X and Y be defined by X = cosO and Y = sin O, where O is 

a random variable uniformly distributed over (0, 27). 

Compute E(X), E(Y), E(XY), E(X?), E(Y?), E(X?Y?). 

Let X be a Normal RV with X:N(yu,07). Show that E{(X — y)?**+1} = 0, while 

E[(X — y)?*] = [(2k)!/2" kl]o2*. 

(a) Write a MATLAB program (.m file) that will compute the pdf for a Chi-square 

RV Z, and display it as a graph for n = 30,40,50. (b) Add to your program the 
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4.69 


4.70 


4.71 


4.72 


4.73 


4.74 
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capability to compute Plu — o < Zn < p +0). Compare your result with a Gaussian 
approximation Plu — o < X < +a], where X:N (n, 2n). 
Let X;,i=1,...,4, be four zero-mean Gaussian RVs. Use the joint CF to show that 


E[X1 X2X3Xa| = E[X Xo] E[X3X4] + E[X1 X33] E[X2X4] 
+E[X2X;]E[X: X4]. 


Compute the MGF and CF for the Chi-square RV with n degrees of freedom. 
Let E[X;] = p, Var[X;] = o?. We wish to estimate u with the sample mean 


1 N 
J Xi. 
i=1 


Compute the mean and second moment of fì assuming the X; for i =1,...,N are 
independent. 

Is the converse statement of Problem 4.62 true? That is, if E[X] = 0, does that imply 
that fx(z) = fx(—z)? 

Derive the moment generating function of a random variable X with pdf fx(x) = 
de~** r > 0, > 0 and zero otherwise. Hence obtain the mean and variance of X. 
Assuming that the X; are iid. and Normal, show that Wn 4 DaX: — 2 
È; X;)/a]? is Chi-square with n — 1 degrees of freedom. 

(conditional expectation) Let Y = X + N, where the RVs X and N are independent 
Poisson RVs with means 20 and 5, respectively. 


(a) Find the conditional PMF of Y given X. 
(b) Find the conditional mean E[Y|X = z]. 


Derive the inequality ox P[|X| > ox] < E||X|] < ox that holds true if fx(x) = 
fx(—2). 

Consider two RVs X and Y together with given values for wx, by. 0%, of, and p. 
We make a linear estimate of Y based on X, that is, 


> 
z| 


ii 


~ 


Y =aX +8. 


Define the estimate error as 
E 4 Y -Y. 


(a) Then find the covariance of the estimate error and the data X, that is, find 
Cove, X] = EleX] — Ele]E[X]. 


Express your answer in terms of a and { and the above given parameter 
values. 
(b) Set œ and £ to their optimal values. Then evaluate Covie, X] again. 
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4.75 In Problem 4.74 we looked at estimating the RV Y from the RV X with the linear 
estimate 


~ 


Y =aX +8. 


It turned out that the optimal values a and £ found in class resulted in Covle, X] = 0. 
Now it is relatively easy to show that this condition, that is, 


Cov[e, X] = 0, 


known as the orthogonality condition, holds for general linear estimation problems 
where, as above, we want to find the best linear estimate in the sense of minimizing 
the mean-square error. In words we say that the estimate error € is orthogonal to 
the data used in the estimate, in this case X. 

Here we consider a slight generalization of this problem. We now form a linear 
estimate of Y based on two RVs X, and X3, that is, 


Y= a, X11 + agXe + 8. 
We will determine the values of a; and a2 from the two orthogonality conditions 
Covle, Xi] =O and Cov{e, X2] = 0. 


To make matters simpler, we assume that all three mean values are zero which implies 
8 = 0, so that the linear estimate simplified to 


~ 


Y = aX, + a2 X2. 


As before, the error is written as € = Y —Y. Note that due to the means being 
zero, Covie, Xi] = EleX,] and Cove, X2] = EleX2]. Please use the following values: 


o? = 1,02 = 4,02 =4, 
Pı = 0.5, po = 0.7, pio = 0.5, 


where p, = E[|X1Y]/oicy, p2 = E[X2Y]/o2oy, and pz = E[X1X-2]/o102 here since 
the mean values are all zero. 


(a) Using these given values, write two linear equations that can be solved for 
a, and a2 using the orthogonality conditions in the form 


EleX,;]=0 and EleX2]=0. 
(b) Solve these two linear equations for a, and a2. 
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P Random Vectors 


5.1 JOINT DISTRIBUTION AND DENSITIES 


In many practical problems involving random phenomena we make observations that are 
essentially of a vector nature. We illustrate with three examples. 


Example 5.1-1 — 
(seismic discrimination) A seismic waveform X(t) is received at a geophysical recording 
station and is sampled at the instants t1, t2,...,t,. We thus obtain a vector X = (Xj,..., 


Xn)”, where X; âx (t;) and T denotes transpose.t For political and military reasons, at one 
time it was important to determine whether the waveform was radiated from an earthquake 
or an underground explosion. Assume that an expert computer system has available a lot 
of stored data regarding both earthquakes and underground explosions. The vector X is 
compared to the stored data. What is the probability that X(t) is correctly identified? 





Example 5.1-2 
(health vector) To evaluate the health of grade-school children, the Health Department of 
a certain region measures the height, weight, blood pressure, red-blood cell count, white- 
blood cell count, pulmonary capacity, heart rate, blood-lead level, and vision acuity of each 
child. The resulting vector X is taken as a summary of the health of each child. What is 
the probability that a child chosen at random is healthy? 


t All vectors will be assumed to be column vectors unless otherwise stated. 
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Example 5.1-3  — =~ ————— Ž > o> 
(disease detection) A computer system equipped with a digital scanner is designed to recog- 
nize black-lung disease from x-rays. It does this by counting the number of radio-opacities 
in six lung zones (that is, three in each lung) and estimating the average size of the opacities 
in each zone. The result is a 12-component vector X from which a decision is made. What 
is the best computer decision? 








The three previous examples are illustrative of many problems encountered in engineering 
and science that involve a number of random variables (RVs) that are grouped for some 
purpose. Such groups of RVs are conveniently studied by vector methods. For this reason 
we treat these grouped RVs as a single object called a random vector. As in earlier chapters, 
capital letters at the lower end of the alphabet will denote RVs; bold capital letters will 
denote random vectors and matrices and lowercase bold letters are deterministic vectors, 
for example, the values that random vectors assume. 

Consider a sample description space 2 with point ¢ and a set of n real RVs X1, Xo,+-+ , Xn 
from Q to the real line R. For each ¢ € Q we generate the n-component vector of numbers 
X(¢) £ (X1(¢), X2(¢),..., Xn(¢)) € R”. Then X £ (Xi, X2,...,Xn) is said to be an 
n-dimensional real random vector. The definition is readily extended to a complex random 
vector. Let X be an n-dimensional random vector defined on sample space Q with CDF 
Fx(x). Then by definitiont 


Fx(x) Ê PIX: < 21,...,Xn < £n]. (5.1-1) 
By defining {X < x} a {X1 < z1,..., Xn < Tn}, we can rewrite Equation 5.1-1 concisely as 
Fx(x) Ê P[X <x]. (5.1-2) 


We associate the events {X < oo} and {X < —co} with the certain event Q and impossible 
event ¢, respectively. Hence 


Fx(co) =1 (5.1-3a) 
Fx(—00) = 0. (5.1-3b) 


If the nth-mixed partial of F(x) exists we can define a probability density function (pdf) as 


A 3” Fx (x) 
x(x) = — ~. 5.1-4 
P(x) Or ...02n ( ) 
tWe remind the reader that the event {X1 < 21,...,Xn < an} is the intersection of the n events 


{X; < z;} fori = 1,... n. If any one of these sub-events is the impossible event e.g., {X; < —oo} then the 
the whole event becomes the impossible event and we would still write Fx (oo) = 0. 
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The reader will observe that these definitions are completely analogous to the scalar defini- 
tions given in Chapter 2. We could have defined 


Pla, < Xı < z1 +Ani,...,2n < Xn < In + Ary] 


lim Ari... Atn 


Azı —=0 


(5.1-5) 


Azn —0 
and arrived at Equation 5.1-4. For example, for n = 2 
Plz, < Xı < z1 + Az1, £2 < X2 < T2 + Aza] 
= Fx(z1 + Avi, £2 + Are) — Fx (z1, £2 + Are) — Fx (£1 + Avy, z2) + Fx(z1, £2). 


Thus (still for n = 2) 


. 1 
fx(x)= ano Aa Aa, hx + Ax,,x2 + Azz) — Fx(x1 + Ari, £2) 


Aza 
— Fx(r1,22 + Azg) + Fx(21, 22)] 
which is by definition the second mixed partial derivative, and thus 


_ PF x(x1, 22) 
fx(21, z2) = arð ` 


From Equation 5.1-5 we make the useful observation that 
fx(x)Ag,... Azn ~ Play < X1 < z1 + Aā1,...,En < Xn < In + Ary] (5.1-6) 


if the increments are small. If we integrate Equation 5.1-4, we obtain the CDF as 

Ly En 
Fx(x) =f -f fx(x’) dr]... drh, 
—oo —ooO 
which we can write in compact notation as 
x 
Fx(x) = J fx(x’)dx'. 

—co 


More generally, for any event B C RY (R% being Euclidean N-space) consisting of the 
countable union and intersection of parallelepipeds 


P[B] = J _, Pred (5.1-7) 


(Compare with Equation 2.5-3.) The argument behind the validity of Equation 5.1-7 follows 
very closely the argument furnished in the one-dimensional case (Section 2.5). Daven- 
port [5-1, p. 149] discusses the validity of Equation 5.1-7 for the case n = 2. For n > 2 
one can proceed by induction. 
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The CDF of X given the event B is defined by 
Fxıs(x|B) Ē P[X < x|B] 
_ PIX <x,B] 
PIB] 
These and subsequent results closely parallel the one-dimensional case. Consider next the 


n disjoint and exhaustive events {B;, i = 1,...,n} with P[B,] > 0. Then U7_, B; = Q and 
B,B; = ¢ for all i # j. From the Total Probability Theorem 1.6-1, it then follows that 


(P[B] # 0). 


n 
Fx(x) = $ Fxg; (x|B:)P[B;]. (5.1-8) 

i=1 
The unconditional CDF on the left is sometimes called a mixture distribution function. The 
conditional pdf of X given the event B is an nth mixed partial derivative of Fx) 3(x|B) if 


it exists. Thus, 
a O° Fx)B(x|B) 
x|B) = ————_. 5.1-9 
fxjp(x|B) dn, ...dt_ ( ) 


It follows from Equations 5.1-8 and 5.1-9 that 
fx(x) = X` fxis(x|B:)P[B;]. (5.1-10) 
i=1 


Because fx (x) is a mixture, that is, a linear combination of conditional pdf’s, it is sometimes 
called a mizture pdf.t 
The joint CDF of two random vectors X = (Xj,..., Xn)? and Y = (Y3, ..., Ym)? is 


Fxy(x,y) = PIX <x, Y < y]. (5.1-11) 
The joint density of X and Y, if it exists, is given by 


airtm) Fxy (x, y) 


8T... OLn Oy, . - -OYm (5-1-12) 


fxy(x, y) = 


The marginal density of X alone, fx(x), can be obtained from fxy(x, y) by integration, 
that is, 


fx(x) = fof. Fxy(x,y) dy... dym. 


Similarly, the marginal pdf of a reduced vector X’ £ (X1,...,Xn-1)" is obtained from the 
pdf of X by 


fx) 2 T fx(x) dz, where x’ Ê (z1,... „Zn-1)". (5.1-13) 


Obviously, Equation 5.1-13 can be extended to all the other marginal pdf’s as well by merely 
integrating over the appropriate variable. 


tThis usage is prevalent in statistical pattern recognition. 
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Example 5.1-4 ~~ ~~= > 
(particle at random) Let X = (X1, X2, X3)T denote the position of a particle inside a sphere 
of radius a centered about the origin. Assume that at the instant of observation, the particle 
is equally likely to be anywhere in the sphere, that is, 


3 2 2 2 
fx(x) = {a vai +T + T3 So 
0, otherwise. 


Compute the probability that the particle lies within a subsphere of radius 2a/3 contained 
within the larger sphere. 


Solution Let E denote the event that the particle lies within the subsphere (centered at 
the origin for simplicity) and let 


RÊ {z1, £2, £3: 1/22 + 23 + T2 < 2a/3}. 


P[E] = Jf Jx(£1, £2, £3) dzı dz2 dz3 


is best done using spherical coordinates, that is, 


PIE l= EE, E r° sin ¢ dr dọ dð. 


Note that in this simple case the answer can be obtained directly by noting the ratio of 
volumes, that is, (2a/3)3 + a? = 8/27 ~ 0.3. 


Then the evaluation of 


5.2 MULTIPLE TRANSFORMATION OF RANDOM VARIABLES 


The material in this section is a direct extension of Section 3.4 in Chapter 3. Let X be an 
n-dimensional random vector defined on sample space 2. Then consider the n real functions 


y= g(x, T2,°°° Zn) 
Yo = g2(T1, £2,:-* In) 
. (5.2-1) 
Yn = Gn(X1,Z2,°** , En), 
where the g;, i = 1,...,n are functionally independent, meaning that there exists no function 
H(y1,y2,---,Yn) that is identically zero. For example, the three linear functions 
Yi = T1 — 222 + T3 
Y2 = 341 + 2T2 + 273 (5.2-2) 


Yz = 5a, — 2z2 + 4x3 
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are not functionally independent because H (ys Y2,- --, Yn) = 2y1 + y2 — y3 = 0 for all values 
of z1, £2, 23. We create the vector of n RVs Y = â Y, Yo,...,; Yn) according to 
y = = g(Xı, X2, _ Xn) 
Y= = g2(X1, X2,- n x, ) 
” (5.2-3) 
Yn = gn(Xı, X2, vt Xn). 
In this way we have generated n functions of n RVs. In order to save on notation, we let 
x4 (21, 22;---,;2n), Y 4 (41, Ye,---;Yn) and ask: Given the joint pdf fx(x), how do we 
compute the joint pdf of the Y;,i = 1,...,n, that is fy(y)? Note that if we start out with 
fewer RVs Y;, say i = 1,...,m, than the number of X;, say i = 1,...,n with m < n, we 
can add more Y; by introducing auxiliary functions as we did in Example 3.5-4. 
We assume that we can solve the set of Equations 5.2-1 uniquely for the z; i = 
1,...,7, as 
t= 01 (Yi, 925°" . Yn) 
T2 =¢ (yi, Y2 y ) 
“ . (5.2-4) 


n= Pn (Yrs Y2, ute Yn): 


Now consider the infinitessimal event A £ {C: yi < Yi < yi + dyi, i=1,...,n}. Here the 
Y; are restricted to take on values in the infinitesimal rectangular parallelepiped that we 
denote by A. Following the procedure in Equations 3.4-5 to 3.4-8, we write 


PA] = L fly) dy = fy(y)V, = J, fx(x) de = fx(x)Vz, (5.2-5) 


where Y, is an infinitesimal parallelepiped (not necessarily rectangular), V, is the volume 
of A, and Vz is the volume of A. From Equation 5.2-5 we obtain 


Va 
fy(y) = fed (5.2-6) 


The ratio of infinitesimal volumes is shown in Appendix C to be the magnitude of the 
determinant J, given by 


ð$ Ob) 
ðyı Byn 

J=]| : : (5.2-7) 
Pn | On 
ðyı On 
Ox, Ln 

=| : : =J. (5.2-8) 
99n Bgn 


zı En 
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Hence 
f(y) = TI = fx. (5.2-9) 


In general, the infinitesimal rectangular parallelepiped in the y system maps into r disjoint, 
infinitesimal parallelepipeds in the x system. Then the event A, as defined above, is the 
union of the events E; i = l,...,r, where E; = {X € A} and A™ is one of the r 
parallelepipeds in the x system with volume vO), Since the regions and, therefore, the 
events are disjoint, the elementary probabilities P[E,;] add, and we obtain the main result 
of this section, that is, 


fey) =) fx) A (5.2-10) 
i=1 

= fax) /[Jil. (5.2-11) 
i=1 


In Equations 5.2-10 and 5.2-11 |J] Ê VO /Vy and |J| = |J}. 


Example 5.2-1 
(vector transformation) We are given three scalar transformations of vector x 
g(x) = 2} - 23 
ga(x) = 2} + 23 
93(x) = z3. 
There are four solutions (roots) to the system 
yı = 2 — 25 


2, 2 
y2= Ti + T3 


Y3 = T3. 
They are 
a = ((y1 + y2)/2)"? ay” = (yr + y2)/2)'/? 
aP =((yo—w)/2¥2 a = (lu —)/2)? 
zP = ys sP = ys (5.2-12) 
zy =—((yr +y2)/2)¥? ah) = -((y + y2)/2)¥? : 
ay = ((ye—m)/2)/? as!) = (lz — yn) /2)¥? 
AO) _ (4) _ 
3 Y3 T3 —_ Y3. 


For the roots to be real, y2 > 0, yı + y2 > 0, and y2 — yı > 0. Hence yo > |yıl. In this 
case the single rectangular parallelepiped in the three-dimensional y space maps into four 
disjoint, infinitesimal parallelepipeds in three-dimensional x space. 
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Example 5.2-2 
(more vector transformation) For the transformation considered in Example 5.2-1, compute 


fy (y) if 





L 1 
fro) = Br) exp [5 (0 +28 + 29)], 
i.e. X is a three-dimensional standard Gaussian RV. 


Solution We must compute the Jacobian |J| at each of the four roots. The Jacobian is 
computed as 


224 —222 0 
J= 221 +22 0 = 82122. 
0 0 1 


For example at the first root we compute 

Jy = A(y3 — y1). 
A direct calculation shows that |J;| = |J2| = |J3| = |J4|. Finally labeling the four solutions 
in Equation 5.2-12 as x), X2, X3, X4, we obtain 


1 4 
fy(y) = 42 yA 2, fxs) 


ar) ~3/2 
~ aa exp | a + w| x u(ye)u(yo — ly). 








Although a random vector is completely characterized by its distribution or density function, 
the latter is often hard to come by except for some notable exceptions. By far the two most 
important exceptions are (1) when Fx(x) = Fx, (x1)... Fx„(£n), that is, the n components 
of X are independent, and (2) when X obeys the multidimensional Gaussian law. Case (1) 
is easily handled, since it is a direct extension of the scalar case. Case (2) will be discussed in 
Section 5.6. But what to do when neither case (1) nor (2) applies? Estimating multidimen- 
sional distributions involving dependent variables is often not practical and even if available 
might be too complex to be of any real use. Therefore, when we deal with vector RVs, we 
often settle for a less complete but more computable characterization based on moments. 
For most engineering applications, the most important moments are the expectation vector 
(the first moment) and the covariance matrix (a second moment). These quantities and their 
use are discussed later on in the chapter. Next, we consider random vectors with ordered 
components. 


5.3 ORDERED RANDOM VARIABLES 


In Section 3.4 (Examples 3.4-3 and 3.4-5) we introduced the notion of two ordered RVs. 
Here we generalize to n RVs and obtain some important results regarding these. Ordered 
RVs are quite important because in the absence of any information about the distribution 
of the RVs, the statistics of the ordering transformation can give us significant information 
about such parameters as the median, range, and others that are closely related to the 


t It would be challenging to show that this pdf integrates to unity. 
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parameters of distributions. Consider n i.i.d. continuous RVs each with pdf fx(zx), where 
—co < £ < oo. The joint pdf of all n RVs is fx,...x,(@1,°-- ,2n) = fx(a1)---fx(2n) 
and the joint marginal density of, say, Xı and Xn is obtained by integrating out with 
respect to 22,...,2n—1- Now arrange the n RVs in order of increasing size; that is, if 
Xk = min(X1, +- , Xn) then Yı = Xk, and Y is the next smallest of the {X;,i = 1,...,n, 
i # k}, and Y3 is the next smallest after that until, finally, Yp = max(X1,--- , Xn). We thus 
have performed an ordering transformation, and we can write that the strict inequalities 
Yi < Yə <- < Y,-1 < Yn occur with probability 1, since the X; are assumed continuous 
RVs. We wish to find the joint pdf of the {Y;, i ='1,...,n}. At first glance we might argue, 
incorrectly, that since the set Sı = {X;, i = 1,..,n} contains the same elements as the set 
So = {Y¥i,i = 1, nh, 
fy,--¥n, (M1 nee 1 Yn) = fx,--x,(Y1) vv Yn) 
= fx, (41) +++ fxn Yn) 
for {y; : —0o < y; < 00,1 = 1,...,n}. However, this result ignores the fact that the {Y;,7 = 
1,..,n} are not independent random variables. For example if you have observed X1, what 
have you learned about X2 from observing X,? Nothing it turns out but if you are given 
Yı, you know right away that Y> > Yı and you also know that the probability that Yı > Y2 
is zero. Hence there is no probability mass in the region yı > y2. With this in mind we 
might want to modify the joint pdf’s of the {Y;} to 
Frayna (ty Yn) = fx. (41) fxn (Yn) for yr <y2<--+Yn 
= 0, else. 

However, now we have another problem: The volume enclosed by the modified joint pdf is 
not unity. Indeed for n large it could be substantially smaller than unity. To get the correct 
joint pdf for the {Y;}, we shall use the results of Section 5.2, which allow us to compute the 
pdf of one set of RVs that are functionally related to another set whose pdf we already know. 


We begin by partitioning the n-dimensional space (—oo < 21,2%2,--- ,£n < 00) into n! 
nonoverlapping, distinct regions described by 2; = {ziq} < Tigy < +++ Tig) <7: < Lin) } 
forl<i<nl,1<j <n, and tig) € {r1,22,--- , In}. Note that £i) < Ti) for j < k. 


Each region will have a different size-ordering of its elements. For example, consider 3-space 
(z1, £2, £3). Then a distinct, nonoverlapping partition is 

Ay = (zı <T < z3) 

Ra = (zı < T3 < z2) 

Rz = (T2 < T1 < T3) 

Joa = (T2 < T3 < T1) 

Rs = (x3 <21< z2) 

Re = (x3 <@2< zı). 


For each of the n! regions we define yı 4 Tja) < Y2 4 Til) <t Un â Tinj = 1,..., nl. 
For example in 3-space (11, 22,23) we have 


for fı : Yı = £1; Y2 = T2; Y3 = T3 
for Z2 : Yı = T1; Y2 = T3; Y3 = T3 
for 43 : y1 = £2; Y2 = T1; Y3 = T3 
for 44: Yı = T2; Y2 = T3; Y3 = Tı 
for 45 : Yı = T3; Y2 = T1; Y3 = T2 
for Ag : Yı = T3; Y2 = T2; Y3 = T1. 
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Thus in 3-space (£1, £2, £3), there are six sets of transformation equations and 6 = 3! distinct 
solutions for yı < y2 < y3; there are no solutions in y-space otherwise: 


in #1: yı = gı (£1, £2, 23) = zy; = ġı (Y1, Y2; Y3) = Yı 
= hi (£1, 22,23) = 22325” = p1 (Y1: Y2, Y3) = Yo 

ys = (21, 82, £3) = 23; 24 = 01(y1, yo, ys) = Y3 

in Bq: yı = g4 (£1, £2, £3) = x93 a" = paly, Ya, Y3 = Y1 
= h4(£1, 22,03) = z3; £P = p4(y1, Y2, ys) = Y2 

y3 = qa (21, 22,23) = z1; 2\" = O4(y1, Y2: ys) = Y3 

in a: yr = g2(21, 22, £3) = £152)” = doy, Y2, Y3 = Y1 
Y2 = hz(£1, Z2, £3) = zas = Pa(Y1: Y2: Y3) = Y2 

ys = q2(#1, T2, £3) = £2; 2$ = Oo(y1, Y2, ys) = Y3 

in 25 : yı = gs (£1, £2, £3) = x3; 29) = o5(y1, Y2; Y3 = Y1 
y2 = hs (21, 22,23) = 21; 2)” = vs (yr, y2, Y3) = Y2 

ys = 95(@1, T2, 23) = z2; 2$ = O5(y1,y2, Y3) = Y3 

in Bs: yı = 93(21,22, £3) = £2; $? = os (y1, Y2, Y3 = V1 
yo = ha(21, £2, 23) = 21329) = p3(y1, Y2, Y3) = Yo 

Y3 = q3 (T1, T2, 23) = rair = 03 (Y1, Y2: Y3) = Y3 

in Be : yı = Go(21, £2, 23) = 23329 = pe(Y1, Y2, Y3 = V1 
y2 = he(z1, £2, £3) = aoa ) = = Pe(Y1, Y2, Y3) = Y2 
y3 = qe(z1, 22,3) = z1; 05° = 66(y1, ya, Y3) = Y3- 


The magnitude of the Jacobian of each of these transformations is unity so that Equation 
5.2-11, specialized here (in slightly different notation) for three ordered i.i.d. RVs, yields 
yoe XXX; (zf a, af”) 


m=1 [Jml 


= Paaa). 


Finally, expanding the summation and inserting the appropriate solutions, we obtain 
frayays (yr, Ya, 93) = fx (yr) fx (yo) fx (y3) + fx (yr) fx (ys) fx (yo) + fx (y2) Fx (yr) fx (ys) 
+ fx (yo) Fx (us) Fx (yr) + Fx (ys) Fx (yr) Fx (y2) + Fx (ys) Fx (y2) fx (a) 
= 31 fix (yi) fx (ya) fx (ys). 


This result applies when yi < y2 < y3; otherwise fy, yy, (41, Y2, Y3) = 0. 
We now summarize the result for the general case. We are given n continuous i.i.d. RVs 





Fyive¥s (y1,Y2,¥3) = 


with pdf fx,...x,(@1,°+- ,2n) =I fx(zi) with —oo < 21,%0,-++ ,2n, < coo and consider 


the transformation that orders them PY Sines magnitude so that Yı < Yo <--- < Yn, 
where for i = 1,...,.7, Y; € {X1, X2,-+- , Xn}. Then 
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dy, (from — to 2o) 


dy, (from —» to y,) 





Figure 5.3-1 Showing integration regions for two ordered random variables. 


N 
n! [J fx(y:), for — œ < yı < y2 < +++ < Yn < 00 
i=1 


0, else. 


Syr Ya (Y Yn) = (5.3-1) 


If fy,..-y, (y1,°-°* , Yn) is a true pdf, it must integrate out to 1. This requires an n-fold iterated 
integration. 

To show how this integration is done we consider the n = 2 case. Then integrating 
the function 2fy, y, (y1, Y2) = 2fy, (y1) fyz (y2) over the region —o0 < y1 < Y2 < œ requires 
integrating the integrand from —oo < yı < ye followed by an integration from ~œ < y2 < 
oo. This is shown in Figure 5.3-1. Since —oo < yı < y2 we integrate the yı variable from —oo 
to y2; then we complete the integration over the half-space by integrating the y2 variable 
from —oo to co. 

The extension to the n-dimensional case is straightforward: We integrate the yı variable 
first from —oo to yo; next the y2 variable from —co to ys, etc.; finally the yn variable gets 
integrated from —oo to oo. In this fashion we have integrated over the entire subspace 
—00 < y1 <+: < Yn < co. The last integration yields 


nt f FX" (Yn) Fx (Yn)dyn/(n — tant f FX" (Yn)dF (yn)/(n — 1)!= FZ (Yn) = 1- 
The next development leads to the fundamental result of order statistics. 


Distribution of area random variables 


We begin by defining the area RVs 


Z: | fx(x)dz,i=1,...,n, (5.3-2) 
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where fx(x) is the pdf of a continuous RV X, and Xj,--- , Xn are n ii.d. observations on 
X. After ordering we obtain the Y;,i = 1, ..., n, as the ordered RVs where min(X), ..., Xn) a 
Yi < Y<- < Yn 2 max(X1,..., Xn). We denote Z; somewhat informally as an “area RV” 
because the RV Z; is the area under fx(x) up to Y;. Clearly, because Y; is an RV so is Z;. 
Indeed, we can think of Z; as a CDF with a random argument, hence we may also speak of 
it as a random CDF. We recognize that 7, <--- < Zn because Yı <---< Yn and Z; isa 
monotonically increasing function of Y; for every index i. We consider the transformation 


Yi 
Zi = fx(xz)dz = Fy (yi) i = 1, 3 n, 


where Fx (zx) is a continuously increasing function of z, and hence has a unique inverse at 


every x. The roots of these equations are y = Fy'(z;),i = 1, ..., n, (see Figure 5.3-2) and 
the Jacobian is 


921 gee O41 fx (yi) 0 ee 

Pus . Oyun 0 fx (ys) 0 e 

e o =| œ oon e |=], 0P) (5.3-3) 
. ° . 

Ben Oey e E 0 

mO BR | o a e ob) 


Hence the pdf of the Z;,i = 1,...,n, is determined as 


n (r) 
Ih- fx Ui ) nl, O< 2 <2 Say, <l 


Ths fx”) (5.3-4) 


f2,.--Z, (21, us Zn) =n 


= 0, else. 





Figure 5.3-2 Finding the roots of the transformation y = Fy (z). 
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This non-intuitive result says that the pdf of the Z;,i = 1,...,n, does not depend on the 
underlying pdf fx(x). Equation 5.3-4 enables us to derive a number of important results 
useful in estimating various parameters when we don’t know the underlying distributions. 
See for Example 5.3-5. 


Example 5.3-1 
(area under fx(x) between the smallest and largest observations) We wish to compute 





the area under fx(x) between the smallest, Y} = min(X1,--- ,Xn), and largest, Yp = 
max(Xj,--- , Xn), of the observations in a sample of size n drawn from the pdf fx (x). We 
denote this area with the new random variable 
a f” 
Vin = fx(a)dz . (5.3-5) 
NY 
We note that 
Yn Y 
Vin= | ` fxlajdz- | fe(a)dz = Zn — Z (5.3-6) 
—0o —0o 
hence we need to compute fz, z,,(21,2n) from fz,...z,(Z1,°-* , Zn). This requires integrating 


Equation 5.3-4 over 22, 23,.--;%n—1, recalling that z;_, < z; < 1. The result is 


f2,2,(21,%n) = n(n — 1) (zn — 21)"-? for 0 <2 < Zn <l,n>2 


= 0, else. (5.3-7) 


Consider now two new RVs Vin 4 Zn — Zı, W 4 Zn. To find the pdf of fyw(v,w), we 
consider the transformation v = Zn — 21, W = Zn;0 < v < w < 1. The Jacobian magnitude of 
this transformation is 1 and the only solution to this transformation is zi") = w—v; z2 =w. 


H 
ence fvw (v, w) = n(n — 1j”? for0<w-v<w<1l,n>2 


= 0, else. 


To get the pdf of Vin alone, we integrate out with respect to w. To help with the integration, 
we note that the two inequalities w — v > 0 and w < 1 suggest the triangular region of 
integration shown in Figure 5.3-3 


w=v 
w=1 


dw 


Figure 5.3-3 Region of integration for computing the probability density function of Vin. 
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Beta CDF 





Figure 5.3-4 The beta CDF (Equation 5.3-9) for n = 2 (top curve); n = 4 (middle curve); n = 10 
(bottom curve). 


Thus starting with fy,,(v) = n(n —1)v"~? f dw we obtain 
(5.3-8) 


n(n —1)u™-2(1—v), forO<vu<i1,n>2 
fnt) = { ( l ) else 


This pdf is a special case of the beta density given in Section 2.4 with a = n — 2,8 = 1. 
The distribution function is the probability that the area spread between the largest and 
the smallest is less than or equal to v. It is readily computed as 


nu"! — (n— 1)”, O<v<1 
Fy,,,(v) = 1, v>l (5.3-9) 
0, v <0. 


The beta CDF is shown in Figure 5.3-4 for various values of n. 


Example 5.3-2 —— ~ āo 
(area between any ordered RVs) We can extend the above results to computing the density 
of the areas under fx(z) between any ordered RVs, not necessarily between the first and 
the last. We generalize the notation slightly so that 


Ym Yı Ym 
Vim £ Zm — Zı = J fx(z)dz — J Jx(z)dz = J fx(x)dz,m >l. (5.3-10) 
oo =œ Yı 


Consider 0 < Zı < Z2 < Z3 < 1 with fz,z,2,(21, 22,23) = 3!,0 < z1 < z2 < z3 < 1. We 
first consider the density, fy,,(v), of the RV Voz = Z3 — Z2. Since this involves only Z2 and 
Z3, we must compute fz,z, (22,23) from fz,z,z,(21, Z2, z3). This is done as 


3! p dzy = 3!z2, for 0 < z2 < z3 < 1 
0, else 


S222 (22,23) = { 
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To compute fy,,(v) from fz,z, (22, z3), we define an auxiliary RV B 4 Zz with realizations 
B and an appropriate set of functional equations. In this case a suitable set of functional 
equations are v Ê z3 — 22, 8 Ê zz with roots zh?) =$, z£ = v+ 8. The reader will recognize 


that 8 4 22 serves as the auxiliary variable. Then, using the so-called direct formula yields 


Bva (b, v) = fzaza(b, v + B)/|J| = 316 for 0 < B<1l-v 
= 0, else 


as the Jacobian magnitude |J| of the transformation is unity. 
Finally, integrating over the auxiliary variable ĝ yields 


fira(v) =3! fa” Bab = 82S 0 <v <1 


—0. dlse (5.3-11) 


To compute fy,,(v) from fZ, z,z,(z1, 22,23), we proceed in the same fashion. Here we find 
that fz,z,(21,22) is given by fz,z,(21,22) = 3!(1 — z2) for 0 < zı < z2 < 1 and 0 else. 


Then, using the transformation v 4 z2 — 21,8 4 zı we get the result 


fva) = 3! R -v — pdb = 205" o<v<1 


5.3-12 
= 0, else. ( ) 


We leave the details to the reader. 

The general case is given by the following: let Vin, denote the probability area under 
fx(x) between Y, and Y,, of the samples ordered by size Y, Y2, ..., Yn drawn from the pdf 
fx(z). Then the pdf of Vim is given by 


Fim (V) = wot ey pum tly —v)yP—mdQ<v<l 


= 0, else. (5.3-13) 


Example 5.3-3 SSS 
(expected value of area under fx(z) between ordered samples) Consider the area RVs 
0<2Z, < Z2 < Z3 < 1, where Z; is given in Equation 5.3-2. We wish to compute E[Z;] 
for 1=1,2,3, expecting that E[Z;] < E[Z2] < E[Zs3]. We find that the marginal pdf’s 
fz,{z),i=1, 2,3, are computed as 


fz (z3) = J: J $2, Z223(21, 22, ž3)dz2dz3 = 3(1 — z1)? 
fza(22) = fi. fo” fzıZ1Z4(21, 22, 23)dzidzg = 3lz2(1 — z2) 
fza(z3) = h So” fa: 2223 (21, 22, z3)dzıdz2 = 322. 


From these results it follows that 


E[Z,] = So zfz,(z)dz =; +1 
E|Z2}] = J zfz,(z)dz =2 = 34 
E[Zs] = fo zfza(2)dz =$ = 3%, 


+ 
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which suggests that in the general case, that is 0 < Z1 < Za < < 2, <1, 


E[Z] = 





. 3-14 
n+1 (5 ) 


The general case can be obtained by induction. 


Example 5.3-4 — > S 
(moments of area between ordered samples) Consider the area between two adjacent ordered 
samples. This is given by Vi i+1 = Zi+ı — Zi. The pdf of Vi ¿+1 is given by Equation 5.3-13 
by letting m = i + 1,l =i, which yields fy, ,,,(v) =n(1—v)"~+, for 0 < v < 1 and 0, else. 
Note that this result is independent of i. From this we compute 





f n-1,, _,F(2)P(m) 1 
EVs] =n f v(1—v) v= nnn = ATT? 


where the gamma function [(j) = (j — 1)! for j = 1,2,... and use was made of tables of 
integrals (see for example formula 497, p.67 in A Short Table of Integrals by B. O. Peirce 
and R. M. Foster, Ginn and Company, New York, 1956). The integral can also be found 
online at several places, including www.wolframalpha.com (type ’integral’ at the prompt). 
Likewise 


r(3)P(n) _ 2(n—- 1)! 2 


1 
EV] = n f v(i dv=n n+3) ” (n+2)! ntn 


2 — 2 1 
Hence OV, ipi = (n41)(n42) — mati) ad wry for n >> 1. 
To compute the variances o} i i=1,...,n, we first compute E[Z?] for i = 1,...,n as 


E[Z2] = =n! h Jo” oe “Uo 22dz) -+ -dZ2n—1dZn = 2 ((n + 2)(n + 1)! 
E[Z3| =n! fe JE for ++ fg? 2B fo? dzıdzz---dzn = 6((n + 2)(n + 1)) 


E[Z2} =n! fo 2 2 Kii h 72 dzidzq---d2n 
TD O (n+1)) 


It follows that 
ili +1) 


E|z?] = (n+ 1)(n +2) 


fori =1,...,n 
and the variances, computed as o? = E[Z?] — E*[Z;], yield 


oh = iil) 7? it forn>> 
4% (n+1)\(n+2) (ntl? m+ ” 





Thus for large n, 0%, = E[Zj]/(n + 1). 


` 
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Example 5.3-5 —— > >>> 
(Estimating range of boot sizes) Military boots need to be ordered for fresh Army recruits 
but the manufacture needs to know what range of boot sizes will be required. It is suggested 
that a random sample of n recruits be measured for required boot size. What is the minimum 
value of n that will cover at least 95 percent of the boot-size needs of the recruits? 


Solution Let {X;,i = 1,...,n} denote the iid. set of boot sizes of the n recruits drawn 
from a population with (unknown) pdf fx(x) and let {Y;,i = 1,...,n} denote the order 
statistics of the observations. With Vin = Ke fx(x)dz we need to solve P[Vin > 0.95] = 6, 
where 6 is a measure of the reliability of our estimate of n, that is, in 1006 percent of the 
time, the number n will indeed be the minimum sample size required for estimating the 
boot needs of the recruits. Using P[Vi, < 0.95] = 1 — 6 and Equation 5.3-9 we compute 
n = 93 for 6 = 0.95 and n = 114 for 6 = 0.98. The solution is obtained numerically using 
Excel™. Note that the result is independent of the size of the recruit army. 





5.4 EXPECTATION VECTORS AND COVARIANCE MATRICES? 


Definition 5.4-1 The expected value of the (column) vector X = (Xi,..., Xn)? is 
a vector ys (or X) whose elements y,,...,/4,, are given by 


Hi 4 J -f zi fx(T1,..., En) dr... dEn. (5.4-1) 


Equivalently with 


fx, (zi) 4 J -f fx(x) dzı... dxi-1 ALi41..- din 


the marginal pdf of X;, we can write 


oo 
n= [ Xifx, (xi) dx; i=1,... n. H 


—oo 


Definition 5.4-2 The covariance matrix K associated with a real random vector X 
is the expected value of the outer vector product (X — y)(X — yz), that ist, 


A 
K = E[(X — p)(X — p)7). (5.4-2) 
We have for the (i, 7)th component 


Ki; È E(X: — ,)(X; — g;)] 


tThis section requires some familiarity with matrix theory. 
tWe temporarily dispense with adding identifying subscripts on the mean, covariance and other vector 
parameters since it is clear we are dealing only with the RV X. 
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= El(X; — 4j)(%i — m)] 
= Ky i,j = 1,...,n. (5.4-3) 


In particular with o? £ Ki, we can write K in expanded form as 


2 
oi tae Kin 


K=] : o? : |m (5.4-4) 


2 
Kn eee On 


If X is real, all the elements of K are real. Also since K;; = Kj;, real-valued covari- 
ance matrices fall within the class of matrices called real symmetric (r.s.). Such matrices 
fall within the larger class of Hermitian matrices.? Real symmetric matrices have many 
interesting properties, several of which we shall discuss in the next section. 

The diagonal elements o? are the variances associated with the individual RVs X; 
for i = 1,...,n. The covariance matrix K is closely related to the correlation matrix R 


defined by 

R Ê E[XXŤ]. (5.45) 
Indeed expanding Equation 5.4-2 yields 

K=R- pup" 
or 

R=K+ypp’. . (5.4-6) 


The correlation matrix R is also real symmetric for a real-valued random vector and is 
sometimes called the autocorrelation matrix. Random vectors are often classified according 
to whether they are uncorrelated, orthogonal, or independent. 


Definition 5.4-3 Consider two real n-dimensional random vectors X and Y with 
respective mean vectors fy and py. Then if the expected value of their outer product 
satisfies 

E{XyY7} = pxły", (5.4-7) 


X and Y are said to be uncorrelated. If 
E{XY7}=0 (ann xn matrix of all zeros), (5.4-8) 


X and Y are said to be orthogonal. 


tThe class of n x n matrices for which Kij = Kii For a thorough discussion of the properties of such 
matrices see [5-2]. When X is complex, the covariance is generally not r.s. but is Hermitian. 
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Note that in the orthogonal case E{X;Y;} = 0 for all 0 <i, j < n. Thus, the expected 
value of the inner product is zero, that is, E[X7Y] = 0, which reminds us of the meaning 
of orthogonality for two ordinary (nonrandom) vectors, that is, x’ y = 0. 

Finally if 

fxuy (xy) = fx(x)fy(y), (5.4-9) 


X and Y are said to be independent. W 


Independence always implies uncorrelatedness but the converse is not generally true. An 
exception is the multidimensional Gaussian pdf to be presented in Section 5.6. It is often 
difficult, in practice, to show that two random vectors are independent. However, statistical 
tests exist to determine, within prescribed confidence levels, the extent to which they are 
correlated. 


Example 5.4-1 
(almost independent RVs) Consider two RVs X, and Xz with joint pdf fx,x,(21,21) = 
zı + 22 for 0 < zı <1, 0 < x2 < 1, and zero elsewhere. We find that while X, and X> are 
not independent, they are essentially uncorrelated. To demonstrate this, we shall compute 


E[(X1 — p)(X2 — u2)] as 





Ky2 = Ka = Ra — Ham. 
We first compute 
by = by = JI a(z + y)dz dy = 0.583, 
S 


where S = {(z1, z2): 0 < 2) < 1, 0 < z2 < 1}. 
Next we compute the correlation products 


Rio = Ro 4 I zy(z + y) dz dy = 0.333. 
S 
Hence Kız = K2; = 0.333 — (0.583)? = —0.007. Also we compute 
1 
o? = f z? (x + 4) dx — (0.583)? = 0.4167 — 0.34 = 0.077. 
0 


Hence the correlation coefficient (normalized covariance) is computed to be p = K12/0102 = 
—0.091. For the purpose of predicting Xz by observing X4, or vice versa, one may consider 
these RVs as being uncorrelated. Indeed the prediction error € in Equation 4.3-22 from 
Example 4.3-4 is 0.076. Were X1, X truly uncorrelated, the prediction error would have 
been 0.077. The covariance matrix K for this case is 


0.077 —0.007 


1 —0.09 
k= | ooo oorr. = oorr | | 


—0.09 1 
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5.5 PROPERTIES OF COVARIANCE MATRICES 


Since covariance matrices are r.s., we study some of the properties of such matrices. Let M 
be any nxn r.s. matrix. The quadratic form associated with M is the scalar q(z) defined by 


q(z) 2 2TMz, (5.5-1) 
where z is any column vector. A matrix M is said to be positive semidefinite (p.s.d.) if 
z’ Mz > 0 


for all z. If the inequality is strict, i.e. zZ Mz > 0 for all z 4 0, M is said to be positive 
definite (p.d.). A covariance matrix K is always (at least) p.s.d. since for any vector z £ 
(z1, ...3 Zn)? 
0 < E{[z7(X — p))?} 
= 27 E[(X — p)(X - p)7]z 
=2'Kz (5.5-2) 
We shall show later that when K is full-rank, then K is p.d. 


We now state some definitions and theorems (most without proof) from linear algebra 
[5-2, Chapter 4] that we shall need for developing useful operations on covariance matrices. 


Definition 5.5-1 The eigenvalues of an n x n matrix M are those numbers À for 
which the characteristic equation Mọ = A@ has a solution œ # 0. The column vector 
$ = (61, 2: ---;n)? is called an eigenvector. 


Eigenvectors are often normalized so that ¢7¢ Ê lel? =1. E 


Theorem 5.5-1 The number À is an eigenvalue of the square matrix M if and only 
if det(M — àI) = 0.1 E 


Example 5.5-1 
(eigenvalues) Consider the matrix 


m-[# 2) 


The eigenvalues are obtained with the help of Theorem 5.5-1, that is, 


4—x 2 


act | 2 AA 


| = (4—)?-4=0, 


whence 
Ai = 6, Ag = 2. 


tdet is short for determinant and I is the identity matrix. 
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The (normalized) eigenvector associated with A, = 6 is obtained from 


(M — 61)¢ = 0, 
which, written out as a system of equations, yields 
Ta tao] > AT gD” 
The double arrow = means “implies that.” The eigenvector associated with Ag = 2, 
following the same procedure as above, is found from 
2, + 2¢2 = 0 


if 
w t 282 — of 92 = v5 (1). 


Not all n x n matrices have n distinct eigenvalues or n eigenvectors. Sometimes a matrix 
can have fewer than n distinct eigenvalues but still have n distinct eigenvectors. 





Definition 5.5-2 Two n x n matrices A and B are called similar if there exists an 
n x n invertible matrix T, i.e. det T # 0, such that 
T'AT =B. E (5.5-3) 
Theorem 5.5-2 An n xn matrix M is similar to a diagonal matrix if and only if M 
has n linearly independent eigenvectors. E 


Theorem 5.5-3 Let M be an r.s. matrix with eigenvalues àı,..., An. Then M has n 
mutually orthogonal unit eigenvectors ¢,,...,¢,'. E 


Discussion. Since M has n mutually orthogonal (and therefore independent) unit eigen- 
vectors, it is similar to some diagonal matrix A under a suitable transformation T. What 
are A and T? The answer is furnished by the following important theorem. 


Theorem 5.5-4 Let M be areal symmetric matrix with eigenvalues A1,..., An. Then 
M is similar to the diagonal matrix A given by 
Ai 0 
AS 
0 An 


under the transformation 
UMU = A, (5.5-4) 

where U is a matrix whose columns are the corresponding! orthogonal unit eigenvectors 
@;,1=1,...,n, of M. Thus, 

U = ($1, ---, n). (5.5-5) 
Moreover, it can be shown that UTU = I (and that UT = U`!) so that Equation 5.5-4 
can be written as 

UTMU = 4. E (5.5-6) 


tOrthogonal eigenvectors ġ; such that ||ġ;|| = 1 are said to be orthonormal. 
tThat is, Q; goes with A; for i =1,...,n. 
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Discussion. Matrices such as M, which satisfy U7 U = I, are called unitary. They have the 
property of distance preservation in the following sense: Consider a vector x = (21,...,2n)*. 
The Euclidean distance of x from the origin is 


[|x|] Ê (x?x)*/2, 


where ||x|| is called the norm of x. Now consider the transformation y = Ux, where U is 
unitary. Then 
lIylI? = y7y = x7 U7 Ux = ||x||?. 


Thus, the new vector y has the same distance from the origin as the old vector x under the 
transformation y = Ux. 

Since a covariance matrix K of a real random vector is real symmetric, it can be readily 
diagonalized according to Equation 5.5-6 once U is known. The columns of U are just the 
normalized eigenvectors of K and these can be obtained once the eigenvalues are known. The 
diagonalization of covariance matrices is a very important procedure in applied probability 
theory. It is used to transform correlated RVs into uncorrelated RVs and, in the Normal 
case, it transforms correlated RVs into independent RVs. 


Example 5.5-2 —— — ~ s a 
(decorrelation of random vectors) A random vector X = (X1, X2, X3)! has covariance 
matrix! 


2 -1 1 
Kxx = | -1 2 0 
1 0 2 


Design an invertible linear transformation that will generate from X a new random vector Y 
whose components are uncorrelated. 


Solution First we compute the eigenvalues by solving the equation det(Kxx — AI) = 0. 
This yields 4; = 2, A2 = 2 + V2, Az = 2 — V2. Next we compute the three orthogonal 
eigenvectors by solving the equation (Kxx — A,;I)¢@; = 0, i = 1,2,3 and normalize these to 
create eigenvectors of unit norm. Unit normalization is achieved by dividing each component 
of the eigenvector by the norm of the eigenvector. This yields 


tHere we add subscripts to K to help distinguish the covariance matrix of one random variable from 
that of another. 
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Now we create the eigenvector matrix U = |Ø, $2 $3] that, upon transposing, becomes an 
appropriate transformer to make the components of Y uncorrelated. With 


o tL 
v2 v2 
1 1 1 
—-uyra|— —L L 
A=U =| 3 73 2 |: 
1 1 1 
V2 2 2 
the transformation Y = AX yields the components 
1 
Yı = ——(X2 + X. 
1 Ja! 2 + X3) 
1 1 1 
= —X,- = =X. 
Y2 7%! 3%2+3 3 
1 1 1 
=> =X — = X3. 
Y3 z% + 342 753 
The covariance of Y is given by 
2 0 0 
Kyy=|0 2+v2 0 
0 0 2-2 


Actually we could go one step further; by scaling the three components of Y, separately, 
we can make the variance (average AC power) the same in each scaled component. This 
process is called whitening and is discussed in greater detail below. Clearly if Yı is scaled 
proportional to F Y2 is scaled proportional to Ie and Y3 is scaled proportional to Te 
all three outputs will have the same power. 

If d,,...,@, are the orthogonal unit eigenvectors of a real symmetric matrix M, then 
the system of equations 


M@, = A191 
Món = Anon 
can be compactly written as 
MU = UA. (5.5-7) 


The next theorem establishes a relation between the eigenvalues of an r.s. matrix and its 
positive definite character. 
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Theorem 5.5-5 A real symmetric matrix M is positive definite if and only if all its 
eigenvalues are positive. 


Proof First let à; > 0,7 =1,...,n. Then with the linear transformation x 4 Uy we 
can write for any vector x 


x"Mx = (Uy)'M(Uy) 
= y’U'’MUy 
=y’ Ay 


m 
= Ay > 0 (5.5-8) 


unless y = 0. But if y = 0, then from x = Uy, x = 0 as well. Hence we have shown that M 
is p.d. if A; > 0 for all ¿. Conversely, we must show that if M is p.d., then all A; > 0. Thus, 
for any x £0 

0 < xTMx. (5.5-9) 


In particular, Equation 5.5-9 must hold for ¢,,...,@,. But 
0< $7 Mọ; =à, i=1,...,n. 


Hence 4; > 0,7 = 1,...,n. Thus, a p.d. covariance matrix K will have all positive eigen- 
values. Also since its determinant det(K) is the product of its eigenvalues, det(K) > 0. 
Thus when K is full-rank, it is p.d. E 


Whitening Transformation 


We are given a zero-mean n x 1 random vector X with positive definite covariance 
matrix Kxx and wish to find a transformation Y = CX such that Kyy = I. The 
matrix C is called a whitening transform and process of going from X to Y is called a 
whitening transformation. Let the n unit eigenvectors and eigenvalues of Kx x be denoted, 
respectively, by @;,44,2 = 1,--- ,n. Then the characteristic equation Kxx@,; = Aih; i = 
1,---,” can be compactly written as KxxU = UA, where U 2 [P1 $2 -7 On] and A 4 
diag(A1, A2,--- ,An). Since Kxx is p.d., all its eigenvalues are positive and the matrix 
A712 8 = diag(1 j Vài, 1/Và2, +4, 1/VÀn) exists and is well defined. Now consider the trans- 
formation Y = CX = A~}/ 2yTX. Then 


Kyy = E[YYT] = E[CXXTCT] = A“!2UT E[XXT]JUA "? = A-1/2UTK xx UAT! 
= AUT (KxxU)A™? = AV2UT(UA)A T" = A{UTU)AAT™? = A712 
AAT? =I, since UTU =I. 


Example 5.5-3 — = > > S 
(whitening transformation) In Example 5.5-2 we considered the random vector X with 
covariance matrix and eigenvector matrices, respectively 


Sec. 5.6. THE MULTIDIMENSIONAL GAUSSIAN (NORMAL) LAW 331 





2-11]. 0 1/V2 1/72 
Kxx = |-1 2 0] U= |]1/V2 -1/2 1/2 | = U7 
102 1//2 1/2 —-1/2 
with 
2 0 0 
A=]0 2+V2 0 
0 0 2-¥v2 
Then 
1/V2 0 0 0 1//2 1/V2 
Y=| 0 (2+v2 "2 0 1/V2 1/-2 1/2 |X 


0 0 (2— J2)-¥/?] [1//2 1/2 -1/ 


is the appropriate whitening transformation. Whitening transformations are especially useful 
in the simultaneous diagonalization of two covariance matrices.' 


5.6 THE MULTIDIMENSIONAL GAUSSIAN (NORMAL) LAW 


The general n-dimensional Gaussian law has a rather forbidding mathematical appearance 
upon first acquaintance but is, fortunately, rather easily seen as an extension of the one- 
dimensional Gaussian pdf. Indeed we already introduced the two-dimensional Gaussian pdf 
in Section 4.3 but there we did not infer it from the general case. Here we consider the 
general case from which we shall be able to infer all special cases. We already know that if 
X is a (scalar) Gaussian RV with mean p and variance o”, its pdf is 


fx(t) = p= exp (-} (z - ey’) 


First, we consider a random vector X = (Xi,...,Xn)? with independent components X;, 
i = 1,...,n, each distributed as N(y;,0;7). Then the pdf of X is the product of the 
individual pdf’s of X),..., Xn, that is, 








fx(z1, tee Zn) = I] fx. (2s) 
i=l 


1 1 i — Hy 2 
KO -3 2 (=) | , (56-1) 


i=l 


tSuch diagonalizations occur in a branch of applied probability called pattern recognition. In particular, 
if one is trying to distinguish between two classes of data, it is easier to do so when the data are represented 
by diagonal covariance matrices. 
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where u;, o2 are the mean and variance, respectively, of X;, i = 1,...,n. Equation 5.6-1 
can be written compactly as 


fx(x) = TARKA exp[—3(x — u)” Kx(x — 4)], (5.6-2) 
where 
o? 0 
Kxx Ê a , (5.6-3) 
0 o2 


H= (t, ---, My)? , and det(Kxx) = []j_, o?. Note that Kx is merely 


oy? 0 
Kxx = . 
0 oz? 
Note that because the X;, i = 1,...,n are independent, the covariance matrix Kxx is 
diagonal, since 
E(X: - m:)}] Êo i=1,...,n. (5.6-4) 
E(X: - wi )(Xj—u)) =O i#j. (5.6-5) 


Next we ask, what happens if Kxx is a positive definite covariance matrix that is not 
necessarily diagonal? Does Equation 5.6-2 with arbitrary p.d. covariance Kx x still obey 
the requirements of a pdf? If it does, we shall call X a Normal random vector and fx(x) 
the multidimensional Normal pdf. To show that fx(x) is indeed a pdf, we must show that 


fx(x) >0 (5.6-6a) 


and 
J j fx(x)dax=1 (5.6-6b) 


(We use the vector notation dx 4 dz, dz2... dz, for a volume element.) We assume as 
always that X is real; that is, X1,...,X, are real RVs. To show that Equation 5.6-2 with 
arbitrary p.d. covariance matrix Kxx satisfies Equation 5.6-6a is simple and left as an 
exercise; to prove Equation 5.6-6b is more difficult, and follows here. 


Proof of Equation 5.6-6b when f(x) is as in Equation 5.6-2 and Kx x is an 


arbitrary p.d. covariance matrix. We note that with z Ax u, Equation 5.6-2 can 
be written as 1 


(n)"?2[det(Kxx) 2 2)» 


I> 


fx(x) 
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where 
p(z) Ê exp(—327K x42). (5.6-7a) 
With 
A oO 
a= J ġ(z)dz, (5.6-7b) 
we see that 


o0 a 
[R00 r 


Hence we need only evaluate a to prove (or disprove) Equation 5.6-6b. 

From the discussion on whitening transformation we know that there exists an n x n 
matrix C such that Kxx = CC” and C7Kx,C = I (the identity matrix). Now consider 
the linear transformation 

z= Cy (5.6-8) 


for use in Equation 5.6-7a. To understand the effect of this transformation, we note first 
that 


nm 
2° K x2 = ¥! CK, Cy = lly? = out 
i=l 
so that @(z) is given by 


o(z) = [[exl- 2271. 


Next we use a result from advanced calculus (see Kenneth Miller, [5-5, p. 16]) that for a 
linear transformation such as in Equation 5.6-8 volume elements are related as 


dz = | det(C)|dy, 


where dz £ dz,...dz, and dy = dy, ... dyn. Hence Equation 5.6-7b is transformed to 
OO 1 n 
a= f exp (-} Sov dy, ... dy,| det(C)| 
i=1 


= l J O a82 ay) | det(C)} 


—0oO 


= [2n]"/?| det(C)|. 
But since Kxx = CC’, det(Kxx) = det(C) det(CT) = [det(C)]? or 
| det(C)| = | det(Kxx)|!/? = (det(Kxx))!/?. 


Hence 
a= (2n)"/? [det(Kxx)]!/? 
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and 
_ a@ 
[27]”/2 [det(Kxx)]1/? , 
which proves Equation 5.6-6b. I 
Having established that 


7 Gay A exp(—3 (x ~ #) "Kxx(x — #)) (5.6-9) 





fx (x) 


indeed satisfies the requirements of a pdf and is a generalization of the univariate Normal 
pdf, we now ask what is the pdf of the random vector Y given by 


Y 2 AX, (5.6-10) 


where A is a nonsingular n x n transformation. The answer is furnished by the following 
theorem. 


Theorem 5.6-1 Let X be an n-dimensional Normal random vector with positive 
definite covariance matrix Kxx and mean vector u. Let A be a nonsingular linear trans- 


. . : . A ` : . . . 
formation in n dimensions. Then Y = AX is an n-dimensional Normal random vector with 
covariance matrix Kyy = AKxx A” and mean vector B=Ap. E 


Proof We use Equation 5.2-11, that is, 





“fx (xi) 

ry) =), a (5.6-11) 
i=1 ? 

where Y is some function of X, that is, Y = g(X) 2 (g1(X),---,9n(X))7, the x;, i = 

1,...,7, are the roots of the equation g(x;) — y = 0, and J; is the Jacobian evaluated at 

the ith root, that is, 





0x1 Örn 
Ji = det (2) =|: . (5.6-12) 
Ox, Tn x=Xi 


Since we are dealing with a nonsingular linear transformation, the only solution to 
Ax-—-y=0 is x=Avly. (5.6-13) 


Also 
Ji = det (AS) = det(A). (5.6-14) 
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Hence 
1 


fy(y) = (2n)"/? [det (Kxx)]*/2| det(A)| exp(—3(A7ty — u) Kxx(A`*y — 4). (5-6-15) 


Can this formidable expression be put in the form of Equation 5.6-9? First we note that 
[det(Kxx)]'/?| det(A)| = [det(AKxxA7)]}/2. (5.6-16) 


Next, factoring A inverse out of the first and last factors, and combining these terms with 
the inverse covariance matrix, we obtain 


(Ay — w)7Kxx(A“y — n) = (y — An)? (AKxx A7) (y — Ap). (5.6-17) 


But Ap 2 B= E[Y] and AKxx A? = E[(Y—8)(Y—8)"] = Kyy. Hence Equation 5.6-15 
can be rewritten as 


fy(y) = L WE exp[—3(y - 8)" Kyy (y - 8). m (5.6-18) 


(27)"/2[det(Kyy) 

The next question that arises quite naturally as an extension of the previous result is: 
Does Y remain a Normal random vector under more general (nontrivial) linear trans- 
formation? The answer is given by the following theorem, which is a generalization of 
Theorem 5.6-1. 


Theorem 5.6-2 Let X be an n-dimensional Normal random vector with positive 
definite covariance matrix Kx x and mean vector p. Let Amn be an m x n matrix of 
rank m. Then the random vector generated by 


Y = AmnX 


has an m-dimensional Normal pdf with p.d. covariance matrix Kyy and mean vector 3 
given, respectively, by 


Kyy = AmnKxxAnn (5.6-19) 
and 
B=Amnp. E (5.6-20) 


The proof of this theorem is quite similar to the proof of Theorem 5.6-1; it is given by Miller 
in [5-6, p. 22]. 

Some examples involving transformations of Normal] random variables are given below. 
Example 5.6-1 


(transforming to independence) A zero-mean Normal random vector X = (X1, X2)T has 
covariance matrix Kxx given by 





3 -l 
Kx = [3 Z). 
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Find a transformation Y = CX such that Y = (Y1, Y2)" is a Normal random vector with 
uncorrelated (and therefore independent) components of unity variance. 


Solution Write 
E{YY7) = E[CXXTCT] = CKxxC? =1. 


The last equality on the right follows from the requirement that the covariance of Y, Kyy, 
satisfies 


From the previous discussion on whitening, the matrix C must be C = A~!/2UT, where 
A—1/2 is the normalizing matrix 


a [AD 0 
A712 2 | 0 au] (Ai, i = 1,2 are eigenvalues of Kxx) 
2 


and U is the matrix whose columns are the unit eigenvectors of Kxx (recall U~! = UT). 
From det(Kxx — AI) = 0, we find A1 = 4, Az = 2. Hence 


1 o 
_ 14 0 _ a-ij2_ | 2 
a=(5 2|: Z=A7 "= a 
v2 
Next from 
(Kxx-— àI)ġ, =0, with ||¢,|| = 1, 
and 


(Kxx —A2I)¢,=0, with ||2/| = 1, 
we find $i = (1/v2, ~172)7, 2 = (1/v2, 1v2)”. Thus, 


U=(.d2)=—5 [1 1l 


V2 
and 1 1 
1|/2 2 
aji i 
v2 v2 
As a check to see if CKxx C7? is indeed an identity covariance matrix, we compute 
1 1 1 
1j 2 2 3 -1 2 V2] fi 0 
2; 1 1 -1 3} } 1 1 ~10 1j’ 
v2 v2 2 V2 
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In some situations we might want to generate correlated samples of a random vector X whose 
covariance matrix Kx x is not diagonal. From Example 5.6-1 we see that the transformation 


X=C'Y, (5.6-21) 


where C = ZU? produces a Normal random vector whose covariance is Kxx. Thus, one 
way of obtaining correlated from uncorrelated samples is to use the transformation given in 
Equation 5.6-21 on jointly independent computer-generated samples. This procedure is the 
reverse of what we did in Example 5.6-1. 


Example 5.6-2 __. SSS 
(correlated Normal RVs) Jointly Normal RVs Xı and Xz have joint pdf given by (See 
Equation 4.3-27 and the surrounding discussion in Section 4.3.) 


1 _ 
fx xX_(21, 22) = ono? ft a(z A (tt — 2pr 122 + “)). 


Let the correlation coefficient p be —0.5. From X1, X2 find two jointly Normal RVs Y, and 
Y> such that Yı and Y are independent. Avoid the trivial case of Yı = Yo = 0. 


Solution Define x 4 (£1, £2)! and y = (yi, ye)". Then with p = —0.5, the quadratic in 
the exponent can be written as 


j ‘| x = az? + (b+c)x1 22 + dz, 


z? +2)22+ 22 =x! l 
where the a, b, c, d are to be determined. We immediately find that a = d = 1 and— because 
of the real symmetric requirement—we find b = c = 0.5. We can rewrite fx, x,(71, £2) in 
standard form as 





1 
Pra (Ent) = zag” (307 Ka), 


K- = 1 a b| 4 )]1 05 
XX o2(1—p?) |e d| 32 |05 1|’ 
Our task is now to find a transformation that diagonalizes Kx. This will enable the joint 
pdf of Y; to Yo to be factored, thereby establishing that Yı and Y> are independent. 


The factor 4/30? affects the eigenvalues of Kx x but not the eigenvectors. To compute 
a set of orthonormal eigenvectors of Kxx: we need ¢ only consider Kxk given by 


whence 


for which we obtain À; = 3/2, dg =1 1/2. The corresponding unit eigenvectors are œ} = 


(1/V2)(1,1)7 and @, = (1/V2)(1,—-1)7. Thus with 


~ A | 1 1 
cafi l 
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(normalization by 1/ /2 is not needed to obtain a diagonal covariance matrix so we dispense 
with these factors) we find that 


U7K;3,U = diag(3, 1). 


Hence a transformation that will work ist 


Y = U'X, 
that is, 
Yi = X1+ Xe 
Yə = Xi — Xa. 


To find fy ya (y1, y2) we use Equation 3.4-21 of Chapter 3: 
n 
fry (yi, ye) = 5 fxıx (x:)/|Jil; 
i=1 


where the x; 4 (x), 2)7, i=1,...,n, are the n solutions to y — UTx = 0 and J; is the 
Jacobian. There is only one solution (n = 1) to y — U? x = 0, which is 


+ 
m= 2 
r=" 


and, dispensing with subscripts there being only one root, 


_ Og\ _ 1 1ıļ_ 
y= det (32) = aet |; l=- 


Hence 


1 + — 
friya (Y1, y2) = gfxiXe (Hee, non) 





where o’ ê V30. 





tThere is no requirement to whiten the covariance matrix as in Example 5.6-1. Also, diagonalizing Kxx 
is equivalent to diagonalizing Kxx. 
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Examples 5.6-1 and 5.6-2 are special cases of the following theorem: 


Theorem 5.6-3 Let X be a Normal, zero-mean (for convenience) random vector with 
positive definite covariance matrix Kxx. Then there exists a nonsingular n x n matrix C 
such that under the transformation 


Y = CİX, 
the components Y,,...,Y, of Y are independent and of unit variance. 
Proof | Let C7! = ATUT; 
then Kxx = CC”. Em 


Example 5.6-3 
(generalized Rayleigh law) Let X = (X1, X2, X3)! be a Normal random vector with covari- 
ance matrix 





Kxx = o7I. 


Compute the pdf of Rs £ ||X|| = VX? + XZ + XZ. 


Solution The probability of the event {R3 < r} is the CDF Fp, (r) of R3. Thus, 
— 1 L pd p24 92 
Fr,(r) = [(27)3/2|[o2]3/2) J [fe -ze + r3 + 23) da, dzz dz3, 
7 


where gÊ {(£1, £2, £3): VL? + 22 + 22 < r}. Now let 
zı 4 fcosọ 
Ta 4 £ singcos0 
z3 = Ẹsin ọsin 9, 


that is, a rectangular-to-spherical coordinate transformation. The Jacobian of this trans- 
formation is €? sing. Using this transformation in the expression for Fpr,(r), we obtain 
forr >0 


Fp, (r) = (an) Wa (a2 amem ff a »|-S]¢ sin ġ df dO dọ 


An £ 
= rn f, e exp |- fa dé. 


To obtain fz, (r), we differentiate Fp, (r) with respect to r. This yields 


2r? 2 


fal) = aapa -ga e0, (5.6-22) 
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where u(r) is the unit step and I'(3/2) = 7/2. Equation 5.6-22 is an extension of the 
ordinary two-dimensional Rayleigh introduced in Chapter 2. The general n-dimensional 


Rayleigh is the pdf associated with Rn 4 IXI] = /X?+...+ X2 and is given by 


arr} r? 

SR, (r) = T (5) [202)"/2 exp -2 . u(r). (5.6-23) 
The proof of Equation 5.6-23 requires the use of n-dimensional spherical coordinates. Such 
generalized spherical coordinates are well known in the mathematical literature [5-5, p. 9]. 
The demonstration of Equation 5.6-23 is left as a challenging problem. 








5.7 CHARACTERISTIC FUNCTIONS OF RANDOM VECTORS 
In Equation 4.7-1 we defined the CF of a random variable as 
x(w) Ê Efex]. 


The extension to random vectors is straightforward. Let X = (X1,...,Xn)? be a real n- 
component random vector. Let w = (w1, ..., Wn)? be a real n-component parameter vector. 
The CF of X is defined as 

x(w) Ê Efex]. © (5.7-1) 


The similarity to the scalar case is obvious. In the case of continuous random vectors, the 
actual evaluation of Equation 5.7-1 is done through 


x(w) = f j fx(x)et” *dx. (5.7-2) 


In Equation 5.7-2 we use the usual compact notation that dx = dz,... dx, and the integral 
sign refers to an n-fold integration. If X is a discrete random vector, ®x (w) can be computed 
from the joint PMF as 


+00 
}x(w) = D> Px), (5.7-3) 


where the summation sign refers to an n-fold summation. 

In both cases, we see that @x(w) is, except for a sign reversal in the exponent, the n- 
dimensional Fourier transform of fx (x) or Px(x). This being the case, we can recover for 
example the pdf by the inverse n-dimensional Fourier transform (again with a sign reversal). 
Thus, 


fx(x) 1 f i Bx (wje i * dw. (5.7-4) 


T r)" Jo 


The CF is very useful for computing joint moments. We illustrate with an example. 
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Example 5.7-1 
(finding mized moment) Let X 4 (X1, X2, X3)T and w 4 (w1, w2,w3)T. Compute 
E[X1X_X3]. 


Solution Since 
00 oo co 
x (wi, w2, w3) -f J J fx(T1, Lo, 23) e3171 tara tars] dz, dzz dz3, 
—OO —oo —-oO 


we obtain by partial differentiation 





1 Ox (wi, we, ws) 
PP 0w 1 0w20w3 wi=w2=w3=0 
00 [e 0] oO 
=f J J 210223 fx (21, £2, £3) dx, dxr2 dT3 
—oo J —oo J—co 
A 
Ê EX1 XX3]. 








Any moment—provided that it exists—can be computed by the method used in 
Example 5.7-1, that is, by partial differentiation. Thus, 


-kit.tkn) Ptt Ox (Wr, wn) 


E[X™ ... Xk] = j 
[Xi wie Bu... Own" 


(5.7-5) 





wi =...=W,=0 


By writing 


Elexp(jw? X)| =E [ex (Ex) =E i aplus x)| 
i=l i=l 


and expanding each term in the product into a power series, we readily obtain the rather 
cumbersome formula 


oO oo 


Ox(w)= So... SO mix...) ED" _ eal (5.7-6) 





which has the advantage of explicitly revealing the relationship between the joint CF and 
the joint moments of the X;,7 = 1,...,n. Of course Equation 5.7-6 has meaning only if 


E(X* ... XE] 


exists for all values of the nonnegative integers kı,..., kn, and when the power series 
converge. 
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From Equation 5.7-2 observe the important CF properties: 


Properties of CF of Random Vectors 


1. |@x(w)| < ®x(O0) = 1 and 
2. į (w) = ®x(—w) (* indicates conjugation). 
3. All CFs of subsets of the components of X can be obtained once ®x(w) is known. 


The last property is readily demonstrated with the following example. Suppose 
X = (Xi, X2, X3)” has CF! 6x (wi, we,w3) = Elexp j(wiX1 + w2X2 + w3X3)]. Then 
® x, x, (Wi, w2) = Ëx, x. x5 (W1, 2, 0) 
Px, xa (w1, w3) = Px, xX (w1, 0, w3) 
x, (w1) = Px, xx, (w1, 0,0). 


As pointed out in Chapter 4, CFs are also useful in solving problems involving sums of 
independent RVs. Thus, suppose X = (X1,...,Xn)7, where the X; are independent RVs 
with marginal pdf’s fx,(z:), i = 1,...,n. The pdf of the sum 


Z=X,4+...+ Xn 
can be obtained from 


fa(z) = fx, (z)*...* fx, (z). (5.7-7) 


However, the actual carrying out of the n-fold convolution in Equation 5.7-7 can be quite 
tedious. The computation of fz(z) can be done more advantageously using CFs as follows. 
We have 


zw) = Eleiest--+Xn)) 


= Il E [e#**] 
i=1 


= [I ax). (5.7-8) 


In this development, line 2 follows from the fact that if X1,..., Xn are n independent 
RVs, then Y; = g:(X:), i = 1,...,n, will also be n independent RVs and E[Y; ... Yn] = 
E[Y;]. .. E[Yn]. The inverse Fourier transform of Equation 5.7-8 yields the pdf fz(z). This 
approach works equally well when the X; are discrete. Then the PMF and the discrete 
Fourier transform can be used. We illustrate this approach to computing the pdf’s of sums 
of RVs with an example. 


Example 5.7-2 
(i.i.d. Poison CF) Let X = (X1,..., Xn)”, where the X;, i = 1,...,n are i.id. Poisson 
RVs with Poisson parameter À. Let Z = Xı +... + Xn. Then the individual PMFs are 





tWe use @x(-) and ®x,x,x,(-) interchangeably if X = (X1, X2, X3)T. 
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Ke 
Px,(k) = Al (5.7-9) 
and 
A A" exp(jwk) x 
dx, (w) = 2 ae 
= er(exp(jw)—-1) | (5.7-10) 
Hence, by independence we obtain 
zw) = [Teeree-? 
i=1 
= erAlexp(jw)—1) (5.7-11) 


Comparing Equation 5.7-11 with Equation 5.7-10 we see by inspection that @z(z) is the 
CF of the PMF 

ke™®2 
ki? 


Pz(k) =" 





k=0,1..., (5.7-12) 
where a Ê nà. Thus, the sum of n i.i.d. Poisson RVs is Poisson with parameter nÀ. 


The Characteristic Function of the Gaussian (Normal) Law 


Let X be a real Gaussian (Normal) random vector with nonsingular covariance matrix Kxx. 
Then from Theorem 5.6-3 both Kxx and Kxx can be factored as 


Kxx = CCT (5.7-13) 
Kxx =DD", D&4[C7}}, (5.7-14) 


where C and D are nonsingular. This observation will be put to good use shortly. The CF 
of X is by definition 


_ 1 °° 1 Tte-1 . T 
x(w) = Or) jde (Kxxx) J exp (-5 - y) Kxx (x - m) - exp(jw” x) dx. 
(5.7-15) 
Now introduce the transformation 
z 2D" (x—p) (5.7-16) 


so that 
zľz = (x — p)" DD" (x — p) 
= (x — u)” Kxx(x ~ p). (5.7-17) 
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The Jacobian of this transformation is det(D?) = det(D). Thus under the transformation 
in Equation 5.7-16, Equation 5.7-15 becomes 


Bxl) = eye RT THOTT Jn? (2) “PUD e 
(5.7-18) 
We can complete the squares in the integrand as follows: 
exp[—{3[z?z — 2jw (D7)-1z}}] = exp(~4w" (DT) -1(D) tw) 
-exp(—3||z — jD7*w||*). (5.7-19) 


Equations 5.7-18 and 5.7-19 will be greatly simplified if we use the following results: (a) If 
Kx = DD’, then Kxx = [D7]-!D~}; (b) det(Kx5,) = det(D) det(D?) = [det(D)]? = 
[det(Kxx)]~1. Hence | det(D)|~! = [det(Kxx)]}/?. It then follows that 


1 1 o0 1 p —1 2 
— -T 2T . —4||z-jD~*w| 
x (w) = exp (i paw Kxxw) Qn J eilz- "dz. 
Finally we recognize that the n-fold integral on the right-hand side is the product of n iden- 
tical integrals of one-dimensional Gaussian densities, each of unit variance. Hence the value 
of the integral is merely (27)"/?, which cancels the factor (27)~"/? and yields the CF for 
the Normal random vector: 


x(w) = expljwT p — dwTKxxw], (5.7-20) 


where p is the mean vector, w = (w1,...,Wn)", and Kxx is the covariance. We observe in 
passing that ®x (w) has a multidimensional complex Gaussian form as a function of w. Thus, 
the Gaussian pdf has mapped into a Gaussian CF, a result that should not be too surprising 
since we already know that the one-dimensional Fourier transform maps a Gaussian function 
into a Gaussian function. 

Similarly the joint MGF for a random vector X = (X1, ..., Xn)? is defined as 


N 
exp D axi 
i=1 
thn 


-> >. oe pe KEXP- Xk] 
ki =0 k2=0 N=0 


from which joint moments can be computed analogously to the CF case. 


Mx(t) E 





SUMMARY 


In this chapter we studied the calculus of multiple RVs. We found it convenient to organize 
multiple RVs into random vectors and treat these as single entities. We found that when i.i.d. 
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random variables are ordered, many probabilistic results can be derived without specifying 
the underlying distributions. In Section 5.3, we derived, among others, the distribution of 
probability area (the area under the pdf between order samples) and the moments of such 
probability areas. We shall see in subsequent chapters that ordered random variables play 
important roles in a branch of statistics called distribution-free or robust statistics. Because 
in practice it is often difficult to describe the joint probability law of n RVs, we argued 
that in the case of random vectors we often settle for a less complete but more available 
characterization than that furnished by the pdf (PMF). We focused on the characterizations 
furnished by the lower order moments, especially the mean and covariance. In particular, 
because of the great importance of covariance matrices in signal processing, communication 
theory, pattern recognition, multiple regression analysis, and other areas of engineering and 
science, we made use of numerous results from matrix theory and linear algebra to reveal 
the properties of these matrices. 

We discussed the multidimensional Gaussian (Normal) law and CFs of random vectors. 
We demonstrated that under linear transformations Gaussian random vectors map into 
Gaussian random vectors. We showed how to derive a transformation that can convert 
correlated RVs into uncorrelated ones. The CF of random vectors in general was defined 
and shown to be useful in computing moments and solving problems involving the sums of 
independent RVs; these assertions were illustrated with examples. Finally, using vector and 
matrix techniques we derived the CF for the Gaussian random vector and showed that it 
too had a Gaussian shape. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 


5.1 Let X), X2 and X; be independent standard normal random variables. Let 
Yi = Xı + X2 + X3 
Yo = Xı — X2 
Y3 = X2 — X3 
Determine the joint pdf of Y1, Yo, Y3. 


5.2 Let B;,i = 1,...,n, be n disjoint and exhaustive events. Show that the CDF of X 
can be written as 


N 
Fx(x) = $ Fxıs:(x|B:)P[B:]. 
t=1 
5.3 Two Gaussian random variables X; and Xz have zero means and variances oĉ; = 4 
and oĉ, = 9. Their covariance is Kx1x2 = 3. If X, and Xz are linearly transformed 
to new variables Y, and Y2 according to Yı = Xı — 2X2 and Yo = 3X, + 4X2, find 
the means, variances and covariance of Y; and Yz. 
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5.4 


5.5 
5.6 


5.7 
5.8 
5.9 
5.10 
5.11 
5.12 


5.13 
5.14 


5.15 


5.16 


5.17 


5.18 


5.19 


5.20 


5.21 


Let X1, X2, X3 be three standard Normal RV’s. For i = 1,2,3 let Y; € {X1, X2, X3} 
such that Yı < Ys < Y; ie. the ordered—by—signed magnitude of the X;. Compute 
the joint pdf fyive¥s (y1, Y2, y3). 

In Problem 5.4 compute the CDF Fp, (y), for i = 1,2,3 and plot the result. 

In Section 5.4 we introduced the RVs Z, and Zn. Show that the joint pdf of Zı and 
Zn is given by Equation 5.3-7. 

Consider the RVs Vin 4 Zn — 21, W = Zn. Show that the joint pdf fy, W (v, w) = 
n(n —1)u"-?, for 0 < w — v < w < 1 and zero else. 

From the results of the previous problem, show that fa, (v) = n(n — 1)v”~?(1 — v), 
for 0 <v < 1, n > 2 and zero else. 

Show that the area under fz, z,z,(21, z2, z3) = 3! with 0 < z1 < z2 < zg < 1 is unity. 
Compute the beta CDF for n = 2, 8 = 0; n = 2, 8 =0. 

Derive Equations 5.3-11, 5.3-12, 5.3-13. 

Use Excel or a similar computer program to generate curves of the beta CDF for 
n = 15,20,30. Describe what seems to happening as n — oo. 

Derive Equation 5.3-14. 

Show that, on the average, n ordered random variables divide that total area under 
fx(z) into n + 1 equal parts. 

Show that any matrix M generated by an outer product of two vectors, that is, 
M = XX", has rank at most unity. Explain why R SE [XXT] can be of full rank. 
Let {X;, i = 1,...,n} be n iid. observation on X and let {Y;,i = 1,...,n} be the 
associated order statistics. Show that Fy, (y) = F% (y). 

Let {X;, i = 1,...,n} be n iid. observation on X and let {Y;, i = 1,...,n} be the 
associated order statistics. Show that Fy, = 1 — (1 — Fx(y))”. 

Let X = cos O and Y = sin O where O is uniformly distributed in (0, 27). Determine 
whether X and Y are independent or not. Verify whether X and Y are uncorrelated. 
Show that the two RVs X; and X; with joint pdf 





1 
a 4, 2<%2<4 
£1,22) = < 16? |zi| < 4, 2 
fx1x2 (£1, £2) fg otherwise 


are independent and orthogonal. 
Let X = (XX2) consist of two unit-variance uncorrelated random variables. Find 
the matrix A such that Y = AX has the covariance matrix 

2 1 p 

K=o0*= where |p| <1 
pl 

Two random variables X and Y have the joint characteristic function ¢y y (w1, w2) = 
exp|—2w? — 8w2] 
Show that X and Y are both zero-mean random variables and that they are uncor- 
related. 
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5.22 Let X;,i =1,...,n, be n mutually uncorrelated random vectors with E[X4] = Hi, 
i=1,...,n. Show that 


nm 


E Do (Ke = oi) DK — Hy)” =} Ki 


i=1 


where K; Ê E[(X; — p;)(Xi — m:)7]. 
5.23 The vector of random variables (X, Y, Z) is jointly Gaussian with zero means and 
the covariance matrix 


1 02 0.3 
K=ų|/02 1 0.4 
0.3 0.4 1 
Find the bivariate density of (X,Y). 
5.24 (a) Let a vector X have E[X] = 0 with covariance Kxx given by 
3 v2 
rala 2 
Find a linear transformation C such that Y = CX will have 
1 0 

Kyy = È | . 


Is C a unitary transformation? 
(b) Consider the two real symmetric matrices A and A’ given by 


+ A 
TERG] 


Show that when a = c and a’ = c’, the product AA’ is real symmetric. More 
generally, show that if A and A’ are any real symmetric matrices, then AA’ 
will be symmetric if AA’ = A'A. 


(K. Fukunaga [5-8, p. 33].) Let Kı and K3 be positive definite covariance matrices 
and form 


K = a, Kı + a2Ka, where a1, a2 > 0. 
5.25 Let A be a transformation that achieves 
ATKA=I AKA = A® = diag(A™,..., A). 


(a) Show that A satisfies 
KKA = AA“), 
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(b) Show that ATKA 2 A) is also diagonal, that is, A‘) â diag(\,..., 
d2)), 

(c) Show that A7K,A and A? K¢A share the same eigenvectors. 

(d) Show that the eigenvalues of A) are related to the eigenvalues of A“) as 


aP = Li- aa] 
t a2 2t 


and therefore are in inverse order from those of A“), 


5.26 (J. A. McLaughlin [5-9].) Consider the m vectors X; = (Xj,.--,Xin)™,i =1,...,m, 
where n > m. Consider the n x n matrix S = 4 0), XXT. 


(a) Show that with W £ (X1...Xm), S can be written as 
1 T 
S=—WW’'. 
m 


(b) What is the maximum rank of S? 


(c) Let S’ Ê 4WTW. What is the size of S’? Show that the first m nonzero 
eigenvalues of S can be computed from 


S'S = pA, 


where ® is the eigenvector matrix of S’ and A is the matrix of eigenvalues. 
What are the relations between the eigenvectors and eigenvalues of S and S’? 
(d) What is the advantage of computing the eigenvectors from S’ rather than S? 


5.27 (a) Let K be an n x n covariance matrix and let AK be a real symmetric 
perturbation matrix. Let A; i = 1,...,n, be the eigenvalues of K and ¢; 
the associated eigenvectors. Show that the first-order approximation to the 
eigenvalues À; of K + AK yields 


à; = 7 (K + AK)¢,, i=l,...,n. 


(b) Show that the first-order approximation to the eigenvectors is given by 
n 
Ad, = > bind, 
j=l 


where bij = 6; AK@,/(Ai — Aj) i# j and bi =0. 

5.28 Let Ai > A2 > ... > Àn be the eigenvalues of a real symmetric matrix M. For 
i > 2, let d,,@2,...,@;_, be mutually orthogonal unit eigenvectors belonging to 
Ai,--.,A:-1- Prove that the maximum value of u?Mu subject to ||u|| = 1 and 
u’¢, =... =uT¢,_, = 0 is 4, that is, 4; = max(u? Mu). 
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5.29 Let X = (Xj, X2,X3)7 be a random vector with p 2 E[X] given by 


p = (5, —5, 6)7 
and covariance given by 
5 2 -1 
K = 2 5 1 
-1 0 4 


Calculate the mean and variance of 
Y = ATX +B, 


where A = (2, —1,2)f and B=5. 
5.30 Two jointly Normal RVs X; and Xə have joint pdf given by 


2 
fx, Xx (21, £2) = a7 exp[—§ (x? + $aj22 + z2). 


Find a nontrivial transformation A in 


such that Y; and Y> are independent. Compute the joint pdf of Y1, Yz. 
5.31 Show that if X = (X1,..., Xn)” has mean p = (H1, ---, Hn)? and covariance 


K= {Ki }nxns 
then the scalar RV Y given by 
A 
Y=piXit... + DnXn 


has mean 


EY] = $ pi: 
i=l 
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5.32 


5.33 


5.34 


5.35 


5.36 


5.37 


5.38 





and variance 


Compute the joint characteristic function of X = (Xi,..., Xn)", where the X;, i = 
1,...,, are mutually independent and identically distributed Cauchy RVs, that is, 


a 
fx (=)= n(x? +a?) 

Use this result to compute the pdf of Y = Ẹ;_] Xi- 

Suppose that U and V are independent, zero-mean, unit variance Gaussian random 
variables. Let X = U +V, Y = 2U +V f 
Find the joint characteristic function of X and Y and find E(XY). 

Let X = (X1, ..., X4) be a Gaussian random vector with E[X] = 0. Show that 


EX, X2X3X4] = Kı2K34 + Kız K4 + Ki4Ko3, 


where the K;; are elements of the covariance matrix K = {K;;}4x4 of X. 

Let the joint pdf of X1, X2, X3 be given by fx(z1, £2, £3) = 2/3 - (£1 + £2 + £3) over 
the region S = {(£1, £2, £3) : 0 < z; < 1,4 = 1,2,3} and zero elsewhere. Compute 
the covariance matrix and show that the random variables X1, X2, X3, although not 
independent, are essentially uncorrelated. 

Let X1, X2 be jointly Normal, zero-mean random variables with covariance matrix 


2 —1.5 
K=| is ah 


Find a whitening transformation for X = (X,X2)7. Write a MATLAB program to 
show a scatter diagram, that is, x2 versus zı where the latter are realizations of 
X2, Xı, respectively. Do this for the whitened variables as well. Choose between a 
hundred and a thousand realizations. 

(linear transformations) Let Yk = $ 7-1 @kj;Xj, k = 1,...,n, where the agj are real 
constants, the matrix A = |a;;]vx» is nonsingular, and the {X,;} are random vari- 
ables. Let B = A7!. Show that the pdf of Y, fy(yi,---, Yn) is given by 


fy (Yis. -- Yn) = |det Bl fx(zj,..., 2%), where cf = XO bin ve fori=1,...,n. 
k=1 


(auxiliary variables) Let Yi = X: and Y = Uj. X;. Compute the 
joint pdf, fy,y2(y1,y2), by introducing the auxiliary variables Yp = gp Xi, 
k = 3,...,n, and integrating over the range of each auxiliary RV. Show the 
fy (Yis---;Yn) = fx(yi—Yo,---)Yn—1 — Yn: Yn). (This problem and the previous 
are adapted from Example 4.9, p. 190, in Probability and Stochastic Processes for 
Engineers, C. W. Helstrom, Macmillan, 1984). 
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Statistics: Part 1 
Parameter Estimation 





P 


6.1 INTRODUCTION 


Statistics, which could equally be called applied probability, is a discipline that applies 
the principles of probability to actual data. Two key areas of statistics are parameter esti- 
mation and hypothesis testing. In parameter estimation, we use real-world data to esti- 
mate parameters such as the mean, standard deviation, variance, covariance, probabilities, 
and distributions. In hypothesis testing we use real-world data to make rational decisions, 
if possible, in a probabilistic environment. We leave the topic of hypothesis testing for 
Chapter 7. 

We recall that probability is a mathematical theory based on axioms and definitions 
and its main results are theorems, corollaries, relationships, and models. While proba- 
bility enables us to model and solve a wide class of problems, the solutions to these prob- 
lems often assume knowledge that is not readily available in the real world. For example, 
suppose we are given that X:N(,07) and we wish to compute the probability of the event 
E = {-1< X < +1}. We do this easily and obtain Fsn((1 — )/o) — Fgn((-1 — p)/o). 
However, in the real world how would we determine the parameters 4,0? For that matter, 
how would we even determine that this is a Gaussian problem? In earlier chapters we used 
important parameters such as uy the average or expected value of a random variable RV 
X; ox, the standard deviation of X; 0%, the variance of X; E[XY], the correlation of two 
RVs X and Y; and others. We estimate these quantities in the real world using so called 
estimators, which are functions of RVs. What are the features of a good estimator? How 
do we choose among different estimators for the same parameter? What strong statements 
can we make regarding how “near” the estimate is to the true but unknown value? 
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Much has been written about parameter estimation but the subject is not exhausted, as 
witnessed by the large number of research articles in the archival literature devoted to the 
subject. There are several excellent books (e.g., [6-1, 6-2]) on statistics and parameter esti- 
mation with an “engineering” flavor, and a plethora of expository material on the internet. 


Example 6.1-1 
(is a coin fair?) Suppose you are involved in a game involving a coin and you would like 
to know if the coin is fair. You flip the coin and observe a head (H). What conclusion can 
you come to? Other than concluding that the coin is not restricted from coming up heads 
there isn’t much else that you can conclude. Now you repeat the experiment and observe a 
tail (T) on the next toss. Can you conclude that the coin is fair? That would be a highly 
risky conclusion. Suppose that in 10 tosses you observe the sequence {H, T, T, H, H, T, H, 
T, H, T}. Based on the observations and using the frequency interpretation of probability, 
we might conclude that P[H] = ny/n = 5/10 = 0.5 and thus that the coin is fair but you 
still cannot be certain. On the other hand, if you observe the sequence {H, T, H, H, H, 
H, T, H, H, H} you might be tempted to conclude that the coin is biased toward coming 
up heads but even here you can’t be certain. Is there a quantitative way of describing our 
uncertainty (or certainty)? In what follows we introduce some ideas that will help to answer 
this question. 











Independent, Identically Distributed (i.i.d.) Observations 


In the coin tossing experiment described above, upon tossing a coin we can define a generic 


RV X as 
A { 1, if a head shows up, 


x= 0, if a tail shows up. 

If we toss the coin n times, we define a sequence of RVs X;,i = 1,...,n, which are called 
independent, identically, distributed (i.i.d.) observations. The collection of these i.i.d. obser- 
vations {X;; i = 1,...,n} is called a random sample of size n from X. In some situations, X 
is more aptly called a population; but the set of observations on X is still called a random 
sample of size n. The X; in this example happen to be Bernoulli RVs but in general they 
could have any distributions as long as they all share the same CDF, pdf, or PMF and each 
observation is unaffected by the outcome of the previous distribution. 

We have already introduced the idea of i.i.d. RVs in connection with our discussion 
of the Central Limit Theorem but elaborate on them some more here because of their 
extraordinary importance in statistics. The observations are independent because, in this 
case, subsequent tosses are not influenced in any way by the outcomes of previous tosses or 
future tosses. More precisely, in terms of the joint probability mass function(s) (PMF) of 
Xi i= 1,...,7 


Px, X2 Xn (21,22 , En) = Px, (21) Px, (£2) +++ Px, (£n). 


They are identically distributed because we are using the same coin in all the tosses and the 
coin is assumed unaffected by the experiment. More precisely: 


Px, (a) = Px, (£) =--- = Px, (2) Ê Px (£), — o0 < z < o0. 


354 Chapter 6 Statistics: Part 1 Parameter Estimation 








When we deal with continuous random variables the property i.i.d. implies: 


ÍX Xo--X_ (£1, 02,°+* En) = fx, (£1) fx. (£2) - ++ fx, (En) 
fx, (2) = fx,(2) =- = fx, (2) Ê fx(x), -00 < £ < 00. 


The idea of i.i.d. observations is counterintuitive for many readers. For example, a coin— 
judged fair by all physical and previous statistical tests—is tossed and comes up heads nine 
times in a row; surely some readers will expect the coin to come up tails on the tenth toss 
to “balance things out.” But the coin has no memory of its past history and on the tenth 
toss it is as likely to come up heads as tails.t 


Example 6.1-2 
(failure of identically distributed condition) We study the arrival rates of customers at a 
barbershop. To that end we partition the workday (7 am to 3 pm) into 16 half-hour intervals 
and count the number of arrivals in each interval. Let X;,i = 1,...,16, denote the number 
of arriving customers in the ith interval. Here the X; are not i.i.d. (failure of the “identically 
distributed” requirement). We expect more arrivals in the early morning, before people must 
report to their jobs, than at other times in the day except possibly during the lunch break. 





Example 6.1-3 
(biased random sampling) A breakfast food company that produces BranPellets™ cereal 
intends to show that eating BranPellets™ will result in weight loss. To that end the company 
hires a pollster to poll those who have attempted to lose weight by eating BranPellets™. 
The pollster begins by randomly selecting from the pool of BranPellets™ eaters but when 
the results do not seem to confirm that eating BranPellets™ results in weight loss, the 
pollster confines the polling to the sub-group of people of average or less-than average 
weight. With X; denoting the weight loss of the ith person polled after three months of 
eating BranPelléts™, we note that the set of {X;} obtained by fair polling are unlikely 
to be distributed by the same law as the set of {X;} obtained by biased polling. Inciden- 
tally, we could formulate this as a hypothesis testing problem by formulating the hypoth- 
esis that eating BranPellets™ will result in weight loss versus the alternative that eating 
BranPellets™ will not result in weight loss. 





Example 6.1-4 — > 
(non-independent sequences) A conservative gambler plays n rounds of blackjack. He starts 
with a stash of $100 and bets only $1 at each round. Let X; denote the value of his stash at 
the ith play. Are the X;, i = 1,...,n, an independent sequence? Clearly Xi+ı = X;+1 hence 
the X; are not mutually independent; for example, P[X; = 10, X:41 = 12] = 0, although 
taken separately neither probability needs to be zero. Let Y; denote the gambler’s win (or 
loss) on the ith play. Then Y; = +1. Are the Y;,i = 1,...,n an independent sequence? The 
answer is yes! because the outcome of the ith play has no memory of the past or future and 
therefore cannot be affected by it. 


tHowever, if in a large number of tosses there are many more heads than tails, the assumption that 
the coin is fair needs to be re-examined. Here hypothesis testing (Chapter 7) is useful in making a stronger 
statement than the coin is “probably fair” or “probably unfair.” 

tSeveral assumptions are at play here, among them that the dealer plays fairly and that the gambler 
doesn’t change strategy as a result of his wins or losses. 
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Example 6.1-5 
(review of joint versus sum probabilities) We make three i.i.d. observations on a zero/one 
Bernoulli RV X and call these X1, X2, X3?. The PMF of X is Px(z) = p"q'~7,p+q=1, 
x = 0,1. The joint PMF of the observations is 





= +£2+23 g3— (T1 +r2+T: 
Px, X2X3(£1, 02,23) = p13 q (ei teates) | 


Note that this is different from the PMF of the sum Y 4x 1 + X2 + X3, which is binomial 
with PMF Py(k) = 6(k;3,p) = (2) pkgs. 





Estimation of Probabilities 


Suppose that, based on observations, we estimate that the probability’ of an event E is 
P[E] = ng/n= 0.44. Here n is the sample size and ng is the number of times the event E is 
observed. How close is 0.44 to the “true” probability of the event? The “true” probability of 
an event is often beyond our means to acquire. Suppose a medical researcher wants to know 
the proportion (probability x 100) that his patient’s red blood cells are undersized. The true 
proportion could, hypothetically, be obtained by counting all the undersized cells among all 
red blood cells in the patient’s body and forming the ratio of the former to the latter. Of 
course this isn’t done. Nevertheless an excellent estimate can be obtained by counting the 
cells in few drops of blood. As another example, suppose one of the states in the United 
States has a county with 343,065 registered voters and 144,087 have voted Republican. Then 
the true probability that a person in this county, picked at random, has voted Republican 
is 0.42. However, the cost of polling 343,065 voters may be prohibitive (or impossible in the 
time allowed) and pollster may have to make predictions with much smaller random samples. 
Thus, suppose that pollsters do a random sampling of 512 voters and find that 225 voters 
have voted Republican. Then the estimated probability of Republican voters is 0.44. Notice 
that if the sample size is small enough, the estimate of Republican voters can be almost 
any number between zero and one. For example if we poll only two voters and they both 
voted Republican, our estimate of the probability of Republican voters would be one! But 
this estimate would be completely unreliable! On the other hand, if we could say something 
like “with a near-certain probability of 0.98 the estimated probability of a Republican voter 
is between 0.42 and 0.46” then we have would have made a “hard” statement about the 
percentage of Republican voters. The probability 0.98 is a hard number because we can be 
nearly certain that the percentage of Republican voters is between 42 and 44 percent. Thus, 
the estimated probability of Republican voters is a “soft” number in the sense that it is, 
typically, quite uncertain and becomes more so as the sample size decreases. In real life we 
would much prefer to make categorical statements about the reliability of estimates than 
offer estimates of uncertain reliability. 


tNote that these X; are discrete random variables. 
§We mentioned in Chapter 1 that in many if not most practical problems, probabilities have to be 
estimated. 
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One of the central goals of parameter estimation is to construct events that are (nearly) 
certain to occur, that is, events whose probability is a “hard” number. That is not to say 
that soft numbers such as the estimated probability p are necessarily unreliable or useless. 
For example, suppose that rental-car dealership handling thousands of cars finds that at 
the end of year 1, ng of its nı cars must be replaced due to wear and tear. Then, all 
things being equal, if the agency starts year 2 with nz cars, it could reasonably expect that 
approximately nz cars will have to be replaced at the end of year 2, perhaps a few more, 


perhaps a few less. Here p a ng/n is the estimated probability that a car will have to be 
replaced by the end year 2. From the point of view of the executives of the company, the 
estimate nəz is useful for year 2 planning and budgeting. 

In Example 6.1-6 below we demonstrate how firm or hard conclusions can be drawn by 
applying basic principles of statistics. 





Example 6.1-6 
(estimating the number of fish in a lake) To illustrate how statistics can be used to generate 
meaningful certain events, consider the following problem. The United States Fish and 
Wildlife Services (FWS), a bureau of the Department of the Interior, is interested in esti- 
mating the percentage of freshwater bass in a large lake that for specificity we call Bass 
Lake. To that end, an “experiment” is performed where a net is used to capture a random 
sample of fish, which is subsequently examined for its bass content. In preparation for this 
experiment, we will denote the number of bass in the sample by ng and the fixed sample 
size by n. Then we form the estimator p = ng/n, which is a random variable because ng 
is a random variable'. We do not consider n a random variable because we can decide a 
priori how big a sample will be examined for its bass content. The true probability p that 
a fish pulled at random from the lake is a bass is the ratio of total bass in the lake to total 
fish in the lake; this number is unknown (and mildly variable over time since fish have a 
tendency to eat each other). At the risk of adding additional notation, we must carefully 
distinguish between the random variable ng (a function) and its realization, which is a 
number. Realizations, whenever they don’t add to confusion, will be superscripted with a 
prime. For example, a realization of ng might yield n's = 58, n = 133 and the estimated t 
probability that a fish selected at random will be a bass is p’ = 58/133 = 0.44. The range 
of the function ng is the set of integers in the interval [0, n]. Of course, the realization 
p is only a one-time estimate of the true probability p that a fish will be a bass and we 
would like to make a stronger statement about the number of bass in Bass Lake. Suppose 
we examine the fish in the sample one-by-one. Let 


x4 1, if the ith fish is a bass, 
an 0, else. 


then X; is a Bernoulli RV with PMF Px,(z) = p*(1 — p)”, for x = 0,1 and zero 
else. The random sample {X;, i = 1,...,n} consists of n i.i.d. observations on a generic 
random variable X, denoting whether a fish is a bass or not. We can think of X as a 


tHere and a few other places we briefly depart from our use of capital letters to denote random variables. 
i The realization of an estimator is sometimes called an estimate, that is, a number. 
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nm 
population, that is, the fish population. The RV Z 4 >> X; represents the total number 
i=l 
of bass in a sample of n fish and Z/n 4 p is the estimator for p. Since Z is the sum 
of independent Bernoulli RVs it is a binomial R.V with PMF b(k;n,p) (Example 4.8-1). 
Then Z has mean np and standard deviation oy = «/np(1—p). We next create the 


(almost) certain event E = {np — 3,/np(1 — p) < Z < npt+3/np(1 =p} and since Z 
is the sum of a large number of i.i.d. random variables we can use the Normal approxima- 
tion to compute P[E] as allowed by the Central Limit Theorem. Indeed this was done in 


Example 4.8-3, where P[E] was computed to be 0.997. We can rewrite P[E], using Z/n 24, 
as P[E] = P [(p — p)? < 2p(1 — p)|=0.997. We suggest that the reader verifies this result. 
The argument is a quadratic in p and solving for the roots pı, p2 of (p—p)* = (9/n) p(1—p) 
will give the end points of the interval of integration about p that will yield an event 
probability of 0.997. These points are 


+m) 5 /( B+O/n)\" 
2[1 + (9/n)] (sate) T+ O/n) (6.1-1) 


For the numbers n = 133,ng = 58, we get p’ = 0.44 and find that pi ~ 0.31, 65 ~ 0.57. 
How do we interpret these results? First note that there is no probability associated with 
the realized interval (0.31, 0.57]; it either contains the true probability p or not; its length 
is 0.26 and |p’ — pi | = |p’ — p| = 0.13. The number 100 x |p’ — pi | is sometimes called the 
margin of error, which in this case is 13 percent!. The interval with end points [p;, p2] is a 
random interval because its end points are random variables; that is, they depend on the 
estimate p. However, on the average, the interval will enclose the point p in 997 times in a 
thousand trials. We note that while the percentage of the bass in the lake is nowhere near 
zero or 100 percent, the probability that bass make up between 31 and 57 percent of the 
fish in Bass Lake is a near-certain event! , 

The above example illustrates how statistics has helped us to make a strong statement 
about the number of bass in Bass Lake. The statement might read like this: Research has 
shown that 44 percent of the fish in Bass Lake are bass. The margin of error is + 13 percent. 


Pi, P2 >= 





Example 6.1-7 
(estimating dengue fever probability) A newspaper article reported that inhabitants and 
visitors on the island of Key West in the State of Florida were being exposed to the virus 
that causes dengue fever. The illness is caused by the bite of a mosquito that carries the 
virus in its gut. While some in the island’s tourist industry minimized the likelihood that 
a visitor would be infected with the virus, an independent study found that among 240 
residents, presumably picked at random, 13 tested positive for the dengue fever virus. Some 
argued that the sample was too small to be accurate and that the dengue fever rate was 
much lower. Compute a 95 percent confidence interval on the true probability that a resident 
picked at random will test positive for dengue fever. 


tIt is not uncommon to describe the margin of error with an algebraic sign, for example in this case 413%. 
tThe New York Times of July 23, 2010. 
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Solution Our estimator for the true mean is p = K/n where K is a Binomial random 
variable i.e. 


Pg |k successes in n tries] 4 b(k; n, p) = (x) pa-p), 


with E|] = p, and Var[p] = p(1 — p)/n. From the data we compute the mean estimate as 
p =13/240=0.054. Since n >> 1, we use the Normal approximation to the Binomial and 
define the standard Normal random variable 


Ape 
X = (p - p)/ V p(l — p)/n 
such that X:N(0,1). Then a 95 percent confidence interval on p is found from solving 
P(—20.975 <X< X0.975) = 2F sn (Z0.975) — 1 = 0.95 or 29.975 1.96. Then 


Pl-1.96 < —2=? — < 1.96] = 0.95. 
p(l—p)/n 
Using the technique in Example 6.1-6, we find that the lower and upper limits of the 95 
percent interval on p, in this case, are the roots of the polynomial 1.016p? —0.124p +.0.003 = 
0, which are p; = 0.033, pu = 0.089. Thus we have a 95 percent confidence that the infection 
rate is from a low of 1 in 30 residents to a high of 1 in 11. Would this knowledge affect your 
plans to visit Key West? 


6.2 ESTIMATORS 


Estimators are functions of RVs that are used to estimate parameters but do not depend 
on the parameters themselves. We illustrate with some examples. 


Example 6.2-1 — > > >> S 
(truth in packaging) A consumer protection agency (CPA) seeks to verify the information 
on the label of packages of cooked turkey breasts sold in supermarkets that says “70% meat, 
30% water.” The turkey breasts are produced by “Sundry Farms” and the CPA buys five 
“Sundry Farms” packages and checks for meat content percentage (mcp). With X; denoting 
the mcp of the ith package, the CAP uses the function 6; = (1/n) X1 X; to estimate the 
average mcp. It finds the following mcp’s in the five packages (n = 5) respectively: 68, 82, 
71, 65, 67 and obtains an average of 70.6 percent meat. 

The 70.6 percent represents a realization of the estimator ©, and is often called an 
estimate. If the CPA buys another set of five packages of cooked turkey breasts from “Sundry 
Farms,” it would no doubt compute a slightly different estimate from the previous. 


Example 6.2-2 —— = S o 
(truth in packaging continued) The CPA seeks to estimate the variability in 


the meat content of “Sundry Farms”turkey breasts. It uses the formula Ô, = 
2\1/2 


(a LODD (x - (1/n) 1 X i) ) with n = 5 and obtains approximately 6.0 percent 


meat variability using the data in the previous problem. 
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Example 6.2-3 —— ~~ = 
(truth in packaging continued) The CPA is criticized for using Oz in the previous problem as 
a measure of variability. It is suggested that the CPA use instead the estimator 


. 2\ 1/2 . 
O; = (asin 1)) X= (xi — (1/n) £j- X;) ) . Using Oz with n = 5, the CPA 


computes a meat variability of 6.7 percent. 


In what follows we shall find that Ô; is an unbiased and consistent estimator for the mean, 
@ is a biased, maximum likely estimator for the standard deviation, and Ô; is an unbiased 
and consistent estimator for the standard deviation. Other estimators are used to estimate 
Var [X], the covariance matrix K and so on for the higher joint moments. 

Some estimators have more desirable properties than others do. To evaluate estimators 
we introduce the following definitions. 


Definition 6.2-1 An estimator? Ô is a function of the observation vector X = 
(X1,..., Xn)? that estimates 0 but is not a function of 0. E 


Definition 6.2-2 An estimator Ô for 0 is said to be unbiased if and only if E[Ô] = 8. 
The bias in estimating 0 with © ist 
IE[ô]- 0|. m 
Definition 6.2-3 An estimator ÔÊ is said to be a linear estimator of 0 if it is a linear 
function of the observation vector X 4 (X1,...; Xn)", that is, 
6 = bX. (6.2-1) 
The vector b is an n x 1 vector of coefficients that do not depend on X. B 
Definition 6.2-4 Let Ô, be an estimator computed from n samples X),..., Xn for 
every n > 1. Then ©,, is said to be consistent if 
lim P||Ôn — 0] > £] = 0. for every e>0. (6.2-2) 
Tl 00 


The condition in Equation 6.2-2 is often referred to as convergence in probability. E 


Definition 6.2-5 An estimator Ê is called minimum-variance unbiased if 
E|(6 — 6)?] < E[(6’ - 6)?] m (6.2-3) 
where ©’ is any other estimator and E[6’] = E[6] = 0. 
Definition 6.2-6 An estimator Ô is called a minimum mean-square error (MMSE) 
estimator if . 
E[(® — 6)] < E[(6’ — 0)?], (6.2-4) 
where ©’ is any other estimator. $ 
tThe validity of estimating parameters as well as other ob jects, for instance probabilities, from repeated 


observations is based, fundamentally, on the law of large numbers and the Chebyshev inequality. 
+The bias is often defined without the magnitude sign. In that case the lines could be positive or negative. 
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There are several other properties of estimators that are deemed desirable such as effi- 
ciency, completeness, and invariance. These properties are discussed in books on statisticst 
and will not be discussed further here. 


6.3 ESTIMATION OF THE MEAN 


In Chapter 4 we showed that the numerical average, u,, of a set of numbers is the number 
that is simultaneously closest to all the numbers z1, £2, . . - , Zn in a set. In this sense 4, can 
be regarded as the best representative of the set. Borrowing from mechanics, some think of 
the average as the center of gravity of the set. While the sample average doesn’t tell the 
whole story, it is a useful descriptor for assessment in all sorts of situations. For example, if 
the average grade on a standardized test earned by students in School A is 92 and the average 
grade on the same test is 71 for students at School B, then, all other things being equal, 
one might conclude that School A does a better job of preparing its students than School 
B. If a large amount of data, suitably corrected for other factors (e.g., sex, income, race, 
lifestyle), showed that the average lifetime of smokers is 67 years while those of nonsmokers 
is 78 years, one could reasonably conclude that smoking is bad for your health. 
Repeating Equation 4.1-1 here with a slight change of notation, 


p(n) = Z ot, (6.3-1) 


we observe that the numerical average depends on the size n of the number of the sample 
as well as the samples themselves. In our model we assume that the data are realizations of 
n i.i.d. observations on the generic random variable X; that is, xı is a one-time realization 
of the observation X1, £2 is a one-time realization of the observation X2, and so forth. Each 
of the X; is a function while z; is a numerical value that the function obtains. We create 
the mean-estimator function 


jtx(n) = Ex , (6.3-2) 


i=1 


from the random sample {X1,...Xn} to estimate the unknown parameter py 2 E[X]. 
We recognize that jiy(n) is the estimator ©, introduced in Section 6.2. The object in 
Equation 6.3-2 is often called the sample mean. We use the hat to indicate that jix(n) is 
an estimator and not the actual mean. Incidentally, it is useful to introduce at this point 
the variance-estimator function (VEF) or the sample variance as 





6% (n) $ — (Xi — ix (n))?. (6.3-3) 
t=1 


n-li — 


We recognize that VEF is the square of the estimator Ô; in Section 6.2. This is one of two 
VEFs that are in common use. The other one is 


tSee, for example, |[6-1,Chapter 8]. 
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` al 
a(n) = > So (X= x(n). (6.3-4) 
i=1 
We shall discuss the estimation of the variance in a later section but for now, we ask the 
reader’s indulgence to take Equation 6.3-3 at face value. The estimation of 0% by the VEF 
in Equation 6.3-3 is, as we shall see later, an entirely reasonable thing to do. Among other 
attractive features we find that E[a%] = 0%, which is only asymptotically true for the VEF 


of Equation 6.3-4. 


Properties of the Mean-Estimator Function (MEF) 


The mean estimator given by Equation 6.3-2 is unbiased meaning that E[iix(n) — ux] =0 
The proof of this important result is easy. We write 


Elix(n)] = E 2 D x|- PLE = rx So = fxn- = px. (6.3-5) 
i=] i=l 


An unbiased estimator is often, but not always, desired. Another and important property 
of an estimator is that, in some sense, it gets better as we make more observations. For 
example, we would expect the MEF in Equation 6.3-2 to be more “reliable” if it is based on 
100 rather than on 10 observations. One way to measure reliability is by way of the variance 
of the unbiased estimator. If the variance of the unbiased estimator is small, it is unlikely 
that a realization of ĝ&y (n) will be very far from the true mean sy; if the variance is large, 
the realization might often be far from the true mean. Consider the variance of ji, (n). By 
definition this is 


o3(n) Ê E [ (x(n) - (E [ix (n)))?] = E [(@x(n) - nx)?] 


n 2 
1 
(: >, (Xi - mo) 


I 


1 n 
-e| E- 
i=1 





ALLOA -u x) 
i=1 jži 


=a DE- l+ DDO BIO — Hx Ks = Ha 
= o%/n. (6.3-6) 


In line 1, the term on the right uses the unbiasedness of the MEF. Line 2 uses the definition 
of the MEF and multiplies and divides 4, by n. Line 3 uses that the square of a sum is 


tOne may tolerate a small amount of bias if the estimator has other desirable properties. 
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the sum of squares plus the sum of cross-term products with nonequal indexes. Line 4 uses 
the linearity of the expectation operator, and line 5 takes advantage of the fact that for 
i # j, Xi andXj; are independent and therefore E [(X; — wx )(X; — ux)] = 0. At this point 
we invoke the Chebyshev inequality (see Section 4.4) and apply it to x(n). Then for any 
5>0 


Pllitx(n) — wy | > 6] < CLU) _ ox meq (6.3-7) 


Equations 6.3-6 and 6.3-7 are among the most important results in all of statistics. Equation 
6.3-6 says that the variance of the mean estimator decreases with increasing n and hence can 
be made arbitrarily small by choosing a large enough sample size. Specifically, the variance 
of the mean estimator is numerically equal to the variance of the observation variable 
divided by the sample size. This is true so long as the observation variable X has finite 
variance. Equation 6.3-7 says that the event that the absolute deviation between the true 
mean and the MEF exceeds a certain value—no matter how small that value is—becomes 
highly improbable when the sample size is made large enough. An estimator that obeys 
Equations 6.3-6 and 6.3-7 is said to be consistent. 


Example 6.3-1 
(effect of sample size on estimating the mean) We wish to compute P||ji,(n) — wx| < 0.1] 
when X is Normal with ox = 3. To illustrate the effect of sample size we use two random 
samples: a small sample (n = 64) and a large sample (n = 3600). We write 


P[-0.1 < fix(n) — px < 0.1] 
= P[-0.1/n/ox <Y < 0.1,/n/ox] 


= 2erf (%15) 


ox 


= 2erf (0.0333,/n) , 








where Y 2 (ñx —Lx)/(ox/Jn) is distributed as N(0,1). When n = 64, P [|£x(n) — ux] < 
0.1] = 0.2. We can interpret this result as saying that in a thousand trials involving sample 
sizes of 64, in only about 200 outcomes will the mean estimate deviate from the true mean 
by 0.1 or less. For n = 3600, we compute P[|&x (n) — ux| < 0.1] ~0.95, which implies that 
the event {|Ax (n) — #x| < 0.1} will occur in about 950 out of a 1000 trials. The implication 
is that in a single trial, the event {|£(n) — “| < 0.1} will almost certainly happen when 
n= 3600. 


Example 6.3-2 — SSS 
(how many samples do we need to get a 95 percent confidence interval on the mean?) We 
want to compute a 95 percent confidence interval on the mean of a Normal random variable 
X. How many observations X1,..., Xn on X do we need? More to the point, what param- 
eters determinate the length and location of the interval? The terminology “95 percent 
confidence interval” merely means that we seek the end points of the shortest (or near- 
shortest) interval on the real line such that we expect that in 950 or so cases out of 1000 
the interval will enclose the true mean. In terms of a probability we write 


Pilitx(n) — ux| < Yo.95] = 0.95, (6.3-8) 
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where the number Yọ.g5 is a number to be determined and its subscript reminds us that it 
is a 95 percent confidence interval we seek. We recall that fix (n) is N(ux,0%/n) so that 


yê ox (6.3-9) 


is N(0,1). Then, rewriting Equation 6.3-8 with Y in mind, we obtain 


0.95 = P|—70.95 < x(n) — Bx < Yo.95] 
=Pl[-yooVn/ox < Y < YoosVn/ox]. (6.3-10) 
= 2Fsn(YoosVn/ox) — 1 


In line 2 we converted the RV ĝy(n)— ux into an N (0, 1) random variable Y. In line 
3 we expressed this probability in terms of the standard Normal CDF. The last line of 
Equation 6.3-10 yields the result we seek, that is Fsn(Yo.95 Vn/o x)= 0.975. As on other 
occasions we use the symbol Fsyn(zu) = u to denote the standard Normal (SN) CDF. 
The number z,, is called the u-percentile of the standard Normal. From the tables of the 
CDF (see Appendix G) we find that zo.975 = 1.96. But since 29.975 = Yo.95/7/ox, we 
deduce that Y995,/n/ox = 1.96 or, equivalently, yp 95 = 1.960x/,/n. Returning to the 
problem at hand, we note that the event {|Âx (n) — ux| < Yo.g95} is the same as the event 
{ix (n) — Yo.95 < Ux < +x (n) +7095}. Then, from the middle line of Equation 6.3-10 we 
get that (on the average) a shortest 95 percent confidence interval for 1x as 


|-1.967% + fix(n), 1.96 + x(n) (6.3-11) 


Of course this result can be generalized to other than 95 percent confidence intervals. 
Suppose we seek a 6-confidence interval (here we specified d= 0.95). Then a -confidence 
interval on fix is 


o ` o ` 
oo + jix(n), zara 7 + hx cn] . (6.3-12) 


How do we know that, on the average, it is the shortest interval? Because of the symmetry 
of the Normal pdf, the largest amount of probability mass is at the center. Any other 95 
percent interval will require more support, that is, need a longer length. 

Let us return to what was asked for. The question as to how many samples are needed 
for a shortest 95 percent confidence interval cannot be determined if ox is not known. 
Clearly, by choosing a large enough interval, for example, a ten-sigma width on either 
side of fix (n), we shall get a 95 percent (and more!) confidence even when the number of 
samples, n, is small. But with a ten-sigma width on either side the interval will not be the 
shortest and will prove useless because it is too large. So let us assume that it is the shortest 
interval that we seek. Then the interval will be centered about jix(n) and have width 
Wo.g5 = 2 x 1.960 x /./n. So clearly, the ratio ox /,/n determines the width of the interval. 
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If øx is known (this is unlikely in practice), then we can determine how many samples 
we need to obtain a confidence interval of a specified length. For an arbitrary -confidence 
interval, the width of the confidence interval is 


Ws =2x Z(1+8)/2 x ox/Vn. (6.3-13) 


Not surprisingly we find from Equation 6.3-13 that the interval gets wider (which increases 
our uncertainty as to the true mean) when the standard deviation of X increases but gets 
smaller (which decreases our uncertainty as to the true mean) when the number of samples 
increases. Also the interval gets wider when the demanded percent confidence increases. 
Does this make sense? 


Procedure for Getting a -confidence Interval on the Mean of a Normal 
Random Variable When ox Is Known 


(1) Choose a value of 6 and compute (1 + 6)/2; 

(2) From the tables of the CDF for the standard Normal find the percentile z(1+5)/2 
such that Fsn(2(1+8)/2) = (1 + 6)/2; 

(3) Obtain the realizations of X;, i = 1,...,n. Label these numbers z;, i = 1,...,n. 


n 
Compute the numerical average p, = 4 YS ti; 
i=1 
(4) Compute the interval [-za45) RA + Hss 2(148) RA + us]: 


Up until now, we have assumed that ox is known. However, ø x is typically not known (Can 
you think of a situation where we do not know uy but know ox?) One possible solution to 
this problem is to replace ox in Equation 6.3-11 by an estimated value of it, for example, 
&x(n), the square root of Equation 6.3-3, and continue with our assumption that Y is 
Normal. But in fact Y would not be Normal because of the randomness in ôx (n) and this 
might not yield accurate results especially when the sample size is not large. Not knowing 
a x requires that we seek another approach for determining a prescribed confidence interval. 
Such an approach is furnished by the t-distribution discussed below. 


Confidence Interval for the Mean of a Normal Distribution When ox 
Is Not Known 


In general, the distributions one encounters in statistics are often of an algebraic form that 
is more complex than those we encounter in elementary probability. One of these is the 
so-called “student’s” t-distribution introduced by W. S. Gossett in connection with his 
work of computing a confidence interval for the mean of a Normal distribution when the 
variance is not known. Gossett is considered one of the founders of modern statistics but is 
better known by his pen name Studentt. As we saw in our previous discussion, the problem 
of finding the end points of a confidence interval involves the distribution of the N(0,1) RV 


t1876-1937. Much secrecy enveloped his work on statistical quality control at the Dublin brewery of 
Arthur Guinness & Son. For this reason he used the pseudo name “Student.” 
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A itx(n) -ux 
Y= oxjyn ` 


However, without knowledge of ox we cannot find the end-points that define the confidence 
interval. So we create a new RV by replacing ox by 


n 


1/2 
6x(n) = (; 1 1 » (Xi — ain?) ’ 


i=1 





which is merely the sigma value derived from the VEF of Equation 6.3-3. This new RV is 
defined by 





Ter’ ae 72 7 xt) x (6.3-14) 
(sts X (Xi - fix(n))*) x(n)/vn 


and is said to have a t-distribution with n — 1 degrees of freedom for n = 2,3... We do 
not treat Tn—ı as an approximation to a standard Normal RV. As n changes, we generate 
a family of t-distributions. We denote the pdf associated with T,_1 by fr(z;n — 1). The 
important thing to observe is that T,,_; does not involve the unknown ox, a fact that 
enables us to compute confidence intervals on the mean jy, something we could not do 
using the RV in Equation 6.3-9. 

It is important for the reader to understand that in creating the t-distribution we did 
not approzimate ox by ôx. The brilliance of the contribution of Gossett was in avoiding 
approximations required to use the Normal distribution and working instead with T,,_; and 
its distribution. 

For insight, we can rewrite Equation 6.3-14 as 





aA (ix(n)—ux)Vn/ox č _ Y 
Ta- = (xg? a/n 21)” (6.3-15) 
(wind @e")’) 





~ \2 
where Y Ê (àx (n) —px)ynjox : N(0,1) and Za £ YZ, Zt) has a x2_, pdf 
x 

with n — 1 degrees of freedom. When spelled out the symbol x? is written Chi-square 
(pronounced ky-square as in sky-square). The subscript of the Chi-square RV gives the 
number of degrees of freedom (DOF) and the RV range is (0,00). This implies that the 
CDF F,2(z;n) = 0 for z < 0 for every integer n > 1. The x? distribution was intro- 
duced in Chapter 2 and is sometimes called a sampling distribution because it involves 
iid. samples of a population X. It is not obvious but Y and Zn-1, although sharing the 
same X;,7 = 1,...,n, can be shown to be statistically independent (see Appendix G). From 
Equation 6.3-14 we see that that the t-random variable is the ratio of a standard Normal 
RV (numerator) to the square root of a quotient of a Chi-square RV divided by the DOF. 
For large values of n the t-distribution will not be that different from the Normal (see 
Figure 6.3-1). Indeed the pdf of T,,_1 is centered at the origin and symmetrical about it. In 
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t-pdf versus Normal pdf 
fr (033), f- 0513), fow OD) 


fœ) 








Figure 6.3-1 The probability density function of the T random variable has a shape similar to that of 
the Normal pdf, especially as the number of degrees-of-freedom get larger. Here is shown the t-pdf for 
n = 3 (peaks at 0.36); n = 13 (the curve with the boxes that peak at 0.39); and the SN pdf. Except for 
a barely observed variation in the tails, the n = 13 t-distribution is virtually identical with the Normal. 


seeking the shortest confidence interval for p x, we consider the event {—ts/2 < Tn-1 < ts/2}. 
The probability of this event is 
Pl—ti146)/2 < Ta—1 < ta4sy/2] = 4, (6.3-16) 


where, as before, 100 x 6 is the assigned percent confidence for interval on py. With the 
CDF for the T,,-1 RV denoted by Fr(t; n — 1) = fio fr(z;n — 1)dz, we find that 


ð= 2Fr(t(1+8)/2 n— 1) —1 


or, equivalently, 
i+6 
Fr(t(146)/237 — 1) = > (6.3-17) 
From the tables of the cumulative t-distribution with DOF n — 1 in Appendix G, we can 
determine the t-percentile t(1+5)/2. Finally, from Equations 6.3-14 and 6.3-16, we obtain 


` ta+sy2ð x(n ` ta+6)/28x (N 
P Jian) -RL < py < a(n) + DEERE) L a 


which gives as a 1006 percentage confidence interval 


faxt - tenat, fix (n) + tasna] . (6.3-18) 


The width of the confidence interval is 


t a 
Ws = ptaroatx t (6.3-19) 
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Procedure for Getting a 6-Confidence Interval Based on n Observations 
on the Mean of a Normal Random Variable when ox Is Not Known 


(1) Choose a value of 6 and compute (1 + 6)/2; 

(2) From the tables of the CDF for T,_1, find the t-percentile number ¢(144)/2 such that 
Fr(ta+8)/2; n— 1) =(1+ 6)/2; 

(3) Obtain the realizations of X;,i = 1,...,. Label these numbers z; i = 1,...,n. 
Compute the realizations of jix(n),éx(n); 


(n) _ ta4sys2Fx (n) 


(4) Compute the numerical realization of the interval [a x = , 


A(n) + Sax], 


Example 6.3-3 
(confidence interval on py when ox is unknown-Normal case) Twenty-one i.i.d. obser- 
vations (n = 21) are made on a Gaussian RV X. These observations are denoted as 
X1, X2,- .., X21. Based on the data, the realizations of Àx (n) and 6x(n)//n are, respec- 
tively, 3.5 and 0.45. A 90 percent confidence interval on jix (n) is desired. 





Solution Since P[—to.95 < Too < to.95] = 0.9, we obtain from Equation 6.3-17 Fr (to.95, 20) 
= 0.5(1 + 0.9) = 0.95. Entering the student-t tables at F = 0.95 and n = 20 we obtain 
to.95 = 1.725. The corresponding interval, from Equation 6.3-18, is [3.5 — 1.725 x 0.45, 3.5 + 
1.725 x 0.45] = [2.72, 4.28]. The width of the interval is Ws =œ 2 x 1.725 x 0.45 = 1.55: 








Interpretation of the Confidence Interval 


The confidence interval generated from a series of realizations either will or will not include 
the true mean of X, which is a number unknown to us. Therefore, what does it mean to 
say that we have a “90 percent” confidence interval? The answer to this question goes to 
the heart of the meaning of probability, namely the frequency of a desirable outcome in 
repeated trials. Put succinctly, a “90 percent” confidence interval means that, say, in a 
thousand trials, one will observe that the interval covers the true mean about 900 times. 
Will we observe exactly 900 true-mean coverage? Not likely, but a success rate of 900 is the 
most likely outcome. 


6.4 ESTIMATION OF THE VARIANCE AND COVARIANCE 


We make n observations X1, X2,..., Xn on a Normal RV X with mean yy and variance 
o%. If 4x is known then an unbiased VEF is computed from the random sample as 


ô% (n) = = > (Xi — ux} (6.4-1) 
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and it is not difficult to show that ôĝ- (n) is an unbiased, consistent estimator of 0%. If the 
mean is not known, then the VEF 


65(n) = — jix(n))? (6.4-2) 


n 
wat 

2, 
is an unbiased and consistent estimator of o% . 


Unbiasedness of &%(n) of Equation 6.4-2. Consider 


2 


n n 
=E'\~ x?P- 2x? ZEX PEE 2X 


i=1 j=1 k=1j>k 
= (n—1)o?. (6.4-3) 


In obtaining Equation 6.4-3, we used the fact that E[X?] = o? + w?,i=1,...,n. Clearly if 


E [Ea - a = (n — 1), 
i=1 


e| FO ar - a”. (6.4-4) 


But the quantity inside the square brackets is 6%(n) of Equation 6.4-2. Hence ôĝ (n) is 
unbiased for g? 


then 





Consistency of &%(n) of Equation 6.4-2 The variance of 6%(n) is given by 
Var[éx(n)] = El(6%(n) — 07)? 


=E |g (Lat OG -aA - a? 


i=1 ij 


2 wee 
tot- — SK - 
i=1 


A straightforward calculation shows that for n >> 1 


Var[6%(n)] = Les, (6.4-5) 
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where c4 £ E[(X1 — p)t] (see Equation 4.3-2a). Assuming that c4 (the fourth-order central 
moment) exists, we once again use the Chebyshev inequality to write that 


var[é2(n)] ~ C4 n—00 


> 2 
PlléX(n) — o*| >e] < z2 ne? 


0. (6.4-6) 


Hence &3,(n) is a consistent estimator for o?. 


Example 6.4-1 
(computing the numerical sample mean and numerical sample variance of a Normal random 
variable) Ten observations are made on a Normal RV X:N(3,1/10). The realizations are: 
3.12, 2.87, 3.04, 2.77, 2.89, 3.34, 3.51, 2.44, 3.28, and 2.95. To compute the numerical sample 
mean and the numerical sample variance, we proceed as follows: 

The numerical sample mean is computed as 





Hs = 5612 + 2.87 + 3.04 + 2.77 + 2.89 + 3.34 + 3.51 + 2.44 + 3.28 + 2.95) = 3.02 
The numerical sample variance is computed as 
o = 5 (0.01 + 0.225 + 0.0004 + 0.0625 + 0.0169 + 0.1024 + 0.2401 
+ 0.3364 + 0.0676 + 0.0049) 
= 0.096. 


In signal processing the ratio (u,/o,)? is sometimes called the signal-to-noise (power) ratio; 
in this case it is 95. It is commonly given in decibels (dB), which in this case is 10x log, 95 = 
19.8 dB. 








Confidence Interval for the Variance of a Normal Random variable 


Determining a confidence interval for the variance involves the x? distribution. Suppose 
we make n i.i.d. observations on the Normal RV X and label these observations as Xj, 
X2,..., Xn- Then, for each i 

A Xi — Bx 
= 


U; (6.4-7) 


N 
is N(0,1) and Zn 4 > U2 is Chi-square distributed with a DOF of n. The x? pdf is shown 
i=l 
in Figure 6.4-1 and is denoted by f,2(x;n). If px is not known in Equation 6.4-7 and we 
replace it with Ax (n) from Equation 6.3-2 we create a new RV 
yê Xi — ûx (n) 
S LO PKN 


(6.4-8) 
ox 


` n 
and the sum Z,_; = >> Vē is also Chi-square but with n — 1 degrees of freedom. 
t=1 
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Chi-square pdf for n=2, n=10 


pdf value 





Figure 6.4-1 The Chi-square pdf for n = 2 (curve with value 0.5 at the origin) and n = 10. For all 
values of n > 2, the pdf will be zero at the origin. 


Example 6.4-2 _ — > 
(computing the degrees of freedom of a Chi-square RV) With the V; defined in Equation 


6.4-8, the random variable S VŽ is Chi-square with a DOF of unity. We can see this with 


the help of a little bit of algebra, We find that V? + VŽ = [(X1 — X2) /oxv2 7 . But 
U Ê (Xı — X2) /oxV2 is N(0,1) and hence in the sum Zn 4 >> U? there is only one 
i=1 


nonzero term, that is, U? = Z4. 
To find a confidence interval on oĝ at level, say, ô (e.g., 5 = 0.95, 6 = 0.98, ô = 0.99), 


we begin with 
n 


Wri 2y v= x 


i=1 i=1 


Me 
x 
% 

x 
= 


and seek numbers a, b such that Pla < W,_-1 < b] = P |e <4 D(X- jix(n))? < | =ð. 
X i=l 


n 
For a > 0, b > 0, and b > a the event {Ç :a < 4 È (Xi — jix(n))? < b} is identical with 


n 
the event {Ç : >> (Xi — fix(n))? < 0% < ŁY (Xi — fix(n))}. Hence the width of the 


confidence interval {for the variance is! 
1 
Wa(a,b) = (— - 5 DÈ (Xi — fix(n))’. (6.4-9) 
Since W,-1 = = È (Xi — fix(n))* is x2_, we solve for the numbers a,b from Pla < 
Wr-1 < b] = Fe o n—1)—F,2(a;n—1). To avoid the algebraic difficulties associated with 


tPlease do not confuse the width symbol Ws (a,b) with the x? random variable symbol Wh. 
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finding the shortest interval, we find numbers a, b that give a near-shortest interval as follows: 
The probability that W,,_1 lies outside the interval is 1-6 = 1— F,2(b;n—1)+F2(a;n—-1); 
if we denote 1 — 6 as the “error probability” and assign 1 — F,2(b,n — 1) = (1 — 6)/2 and 

F,2(a;n — 1) = (1 — ô)/2, then we have divided the overall “error probability” into equal 
area-halves under the tails of the y2_, pdf. It then follows that a = 2(~4)/2, that is, 
Fya(x—s)/2im — 1) = (1 — 6)/2 and b = 2(149),2, that is, Fy2(%a4s)/2im — 1) = (1 + 6)/2. 
The numbers zà- 8) j2 and 2(145)/2 are called, respectively, the (1 — 6)/2 and (1 + 6)/2 
percentiles of the x2_, RV. The 6-confidence interval for the variance is 


: D (X - âx (n), : So- ax) 


T(1+8)/2 i21 %(1-8)/2 Gay 








and its length L is 








1 
Xi- n 
(== TAE 2 ~ Axl”) k 
Example 6.4-3 
2 


Sixteen i.i.d. observations are made on X:N(yx,0%). A confidence interval on oĝ is 
required. Find the numbers a,b that will give a near-shortest 95 percent confidence interval 
o% using the “equal error probability” rule. 


Solution F,2(a;15) = F,2(x0.025;15) = 0.025. F,2(xo.975; 15) = 0.975. From the table of 
the Chi-square distribution, we find a = £o.o25 = 6.26 and b = 29.995 = 27.5. 


Estimating the Standard Deviation Directly 


We can estimate the standard deviation ox from 


n 1/2 
ôx(n) = (Adm - to) (6.4-10) 


but this involves computing 64 (n) first. Another approach estimates ox directly. Consider 
two i.i.d. observations X1, X2 on the generic RV X. Let Z = Â max(X1, Xo), pt ê (xı + X2) /2. 


The pdf of Z is readily computed as fz(z) = 2Fx(z)fx(z), where Fx A and fx(z) are, 
respectively, the CDF and pdf of X. Now consider the estimator ôx 


ôx Ê VT(Z — pix) (6.4-11) 
and compute Efô x] 4 VrE|(Z — x)| = Va(E[Z] — wx). The computation of E[Z] when 


X is Normal can be done with the aid of standard tables of integrals (see Handbook of 
Mathematical Functions, M. Abramowitz and I. A. Stegun, eds., Dover, New York, 1970, 
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p- 303, formula 7.4.14), or with Maple, MathCAD, Mathematica, etc. We find that E[Z] = 
bx + Fox so that Eļõx] = ox. Hence Gx 4 Vm(Z — jx) is an unbiased estimator 
for ox. 

Example 6.4-4 SSS 
(one-shot estimation of ox) Two realizations of X : N(ux,0%) are obtained as 3.8, 4.1. 
Then (primes indicate realizations) Z’ = max(3.8, 4.1) = 4.1, f’ = (3.8 + 4.1)/2 = 3.95, 
and 64, = 0.26. Computing ¢ from Equation 6.3-6 yields 0.21. 





To compute the variance of the standard deviation estimator function (SDEF) in 
Equation 6.4-11 we write: 


Var(éx) = 0(E[Z?] + Ela] — 2E[Zax]) — 0%. 
This computation takes some work but the result is 
Var(éx) = (5 - 1) o% ~ 0.570%. (6.4-12) 


In practice we would not want to estimate ox from only two observations on X. Suppose 
we make n (even) observations on X, which we denote as X1, X2,..., Xn and pair them as 
{X Xo}, e... {Xn, Xn}. Let 


a £ fm (max(X1, Xo) — 0.5(X1 + X2)) 


62 £ Ja (max(X3, X4) — 0.5(X3 + Xa)) 


60/2) A Ja (max(Xn_1, Xn) —0.5(Xn_1 + Xn)) 


and define 
1 n/2 ) 
. A ` 
Gave = 79 2 G (6.4-13) 
A n/2 
Then Var (Gave) = 4 Var(éx)}, which gives 
i=1 
a 1.04 
Var(Gave) © ox: (6.4-14) 


It is straightforward to show that Gave is a consistent estimator for ax; we leave this 
as an exercise for the reader. A confidence interval for ox based on estimating ox with 


ôx Ê /n(Z — jt) is discussed in [6-3] and [6-4]. 


Estimating the covariance 


The covariance, defined by 


cn = Cov[XY] = E[(X — px) (Y — ny )], (6.4-15) 
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is classically estimated from the covariance estimating function (CEF) 


ên $ axl) x Y — y(n), (6.4.16) 


where {X;, ¥j,i = 1,...,n} are n paired i.i.d. observations. We leave it to the reader to 
show that ĉ1ı is an unbiased and consistent estimator for c11. The normalized covariance, 
also called the correlation coefficient, is defined as 


A c 
Pxy = ——=— (6.4-17) 
V Oxy 
It is estimated from 
A ĉn — Yi- (X: — fix(n)) x (Yi — y (n)) (6.4-18) 


1/2° 


Pxy ~ a2 a2 n a 2 n a 2 
vôkôt (Ei (Xi — x(n)? Ela (Yi ~ ây (n))?) 


The distribution of yy is not available in closed form. However, a confidence interval for 
Pxy can be found using more advanced methods [6-1]. 


6.5 SIMULTANEOUS ESTIMATION OF MEAN AND VARIANCE 


If we seek, say, a 95 percent confidence region on both py and oł, we take advantage of 
the RVs jix(n) and ôĝ (n) being independent. Thus, we may write 


fix (nm) — Bx ls s 2 
P |—a < ——— = <a, b < => Xi — n)) < cj = 0.95 6.5-1 
| <in EX < JOG — âx (n) (6.5-1) 
or, equivalently, 


fix(n)— bh 1, a = 
P -a < IN < a x P b < x >, (Xi — fix(n))? < | = 0.95. (6.5-2) 


Equation 6.5-2 follows from Equation 6.5-1 because of the independence of the events 
A fix (n) -ux A 1 Š 
E, Ê ¢ —a < XTX < d £2 = b<} X; — Ê ? <et. 
1 { as ox/¥n <a} an 2 ~ o% 2l t jx (n)) Sc 


We note that . 
A fix(n) ~ bx 
ox/ ln 


is the standard Normal RV N(0,1) with distribution function F'gy (z) while 


is x7,_1 with distribution function F2(z;n — 1). 
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The next step is to associate a probability to each of the events E,, E2. As an example 
we could factor the joint 6-confidence as 6 = vô x vô; this would give for 6 = 0.95 


jix(n) — ux | 
P |—a < ———— = <a| = v0.95 ~ 0.975 6.5-3 
[os Re Sel ay (659 
and 
lic . 
P|b< n 2 (Xi — jfix(n))? < | = v0.95 ~ 0.975. (6.5-4) 





From Equation (6.5-3) we recognize that a = 29.9875, that is, Fsn(zo.9875) = 0.9875, 
the 98.75 percentile of the standard Normal RV. From Equation 6.5-4, we determine—using 
the “equal-error” assignment rule to the tails of the Chi-square pdf-that b = 29.9125 and 
c = Xo.9875, that is, the 1.25 and 98.75 percentiles of the cumulative Chi-square distribution 
F,2(z;n — 1). More generally, for any given 5-confidence interval and any given n, we can 
find numbers a,b, and c to satisfy the confidence constraints. Once this is done we can find 
in the 4,07 parameter space the boundaries of the d-confidence region for py, o% . Event 
E; is the convex region inside the parabola described by o? = n (p — ĝÌ x)? /a*. Event Ez 
is the region between the end points 


OFtax = $ 2 (Xi — Àx (n))? (upper bound), 
oe (6.5-5) 
n . 2 ` . 
OMin = 4 2 (Xi — Êx (n))~ (lower bound). 


The event E; N Ea is then the shaded region shown in Figure 6.5-1. 
In approximately 950 in a 1000 cases, the region shown in Figure 6.5-1 will cover the 
point 4,0%, that is, the true values of the unknown mean and variance. 


o? 
o?= niu- a? 





A 


kar 


Figure 6.5-1 The confidence region for the combined estimation of u and g°?. 
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Example 6.5-1 — SSS 
(confidence region for mean and variance) We make 21 observations {X;,i = 1,...,21} on 
a Normal population X : N(ux, 0%). A 90 percent confidence region is desired for the pair 
Mx; OX. 

* To achieve a 90 percent confidence region, we assign (approximately) a 0.95 probability 
that the N(0,1) RV Z lies in the interval (—a,a) and a 0.95 probability that the Chi- 
square Rv W with DOF of 20 lies in the interval (b,c). From Equation 6.5-3 we obtain 
P[-zo.975 < Z < 20.975] = 0.95; hence, from the standard Normal distribution table, we 
find Fsy(z0.975) = 0.975 or 29.975 = 1.96. From Equation 6.5-4 we obtain P[b < W < q] 
= 0.95, from which we determine numbers b = 2%0.925,¢ = 20.975 using the “equal-error” 
assignment of Example 6.3-3. Thus, F,2(2o.025; 20) = 0.025 and F,2 (20.975; 20) = 0.975 so 
that 29.925 = 9.59 and 29.975 = 34.2. The numbers 29.925 andzo.975 are the 2.5 and 97.5 
percentiles, respectively, of the x? RV. 





6.6 ESTIMATION OF NON-GAUSSIAN PARAMETERS FROM LARGE SAMPLES 


Consider an RV X with mean p and finite variance 07. We make n i.i.d. observations on 
X{X;, i = 1,...,n} and deduce from the Central Limit Theorem that the sample mean 
estimator? (SME) 


à(n) = Z 5x 
i=1 


is approximately Normal as N(,07/n) for large n. If X is a continuous RV then the SME 
is approximately Normal in density, else it is approximately Normal in distribution. When 
the parameters to be estimated are associated with non-Gaussian distributions, it may still 
be possible to estimate them using Equation 6.6-1 as a starting point: 





P |-e< Aah < al =6. (6.6-1) 


which can be rewritten as 
P|(-a0/Vn) + à < u < (a/m) +A] = ô. (6.6-2) 


The reader will recognize that this is the expression for 100 x 6 percent confidence 
interval for p. When distributions are non-Gaussian, the mean and variance may be related 
parameters, that is, o = o (u). How do we handle such cases? We illustrate with two examples 
from [6-2]. 


Example 6.6-1  — > o 
(confidence interval for A in the exponential distribution) Suppose we want to estimate À 
in the exponential pdf fx (x) = Ae~>*u(z). For this law we find 


tRecall we use the mean-estimator function and the sample mean estimator interchangeably. 
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m £ E[X] = f ràe™òdr = X! 
0 


and oo 
o? Ê EI(X — u)?] =) f (X—A7})2 eda = 72, 


Inserting these results into Equation 6.6-2 and rearranging terms to expose A yields 


P [eae <A< eaw] = 6. (6.6-3) 
The number a is obtained from approximating Z â (à — p)J/n/o as a N(0,1) random 
variable. This yields a = 2(145)/2 where Fsy(z4s8)/2) = (1 + ô)/2 Thus a 100 x 6 percent 
confidence interval for À has width 


Ws = 22148)/2/hV/n (6.6-4) 


Example 6.6-2 — SSS 
(numerical evaluation of confidence interval for A) It is desired to obtain a 95 percent 
confidence interval on the parameter À of the exponential distribution from 64 i.i.d. obser- 
vations on an exponential RV X. The estimate is fy = 3.5. From Equation 6.6-1 we obtain 
2xerf(a) = 0.95 or, equivalently, F's (z(148)/2) = (1+6)/2 = 0.975. This gives 29.975 = 1.96. 
Then from Equations 6.6-3 and 6.6-4 we compute that the 95 percent confidence interval 
for A is {0.22 , 0.36} and has an approximate width of 0.14. 


Example 6.6-3 
(confidence interval for p in the Bernoulli distribution) Given a Bernoulli RV X, with 
probability P|X = 1] = p, and P[X = 0] = q = 1 — p we want to estimate p at a 
100 x 6 percent level of confidence from n (sufficiently large) i.i.d. observations on X. For 
this distribution px £ E[X] = p and the MEF is p = (1/n) $; X;. As demonstrated in 
earlier chapters E[p] = p and Var[p] = + Xi; Var[Xi] = npa = pq/n. 

Hence the RV 





A p-p 
z£ VTA (6.6-5) 


for large n is Normal in distribution (not in density since X is a discrete RV) as N (0,1). 
To obtain a 100 x 6 confidence interval on p we write 


p—p 
pq/n 








Pl-a< <al=6 (6.6-6) 


or, equivalently, 
Pl(p — p)? < a?pq/n] = ô. 
As usual we find the constant a from 2 erf(a) = 6, that ist, a = zı+s and find the end points 


2 
of the confidence interval by solving for the roots of (p — p)? — a?pq/n = O(where q = 1 — p). 
These are 


tRecall that 2 x erf(a) = 2 x Fsn(a) — 1 = ô so that a = Z(1448)/2 ie. the (1 + 6)/2 percentile of the 
standard Normal RV. 
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95 percent interval on the 
Bernoulli probability p 


Interval width 
i=) 
(oy) 


200 400 600 800 
Number of samples 


Figure 6.6-1 The width of the confidence interval decreases slowly with an increase in the number of 
samples in Example 6.6-3. Here we assumed that 4pq ~ 1. 


P= EFE ~ HEE ea VOTO 
p- Pt) , 1 aya) ae 


* 2(1+(a?/n)) © 201 + (a?/n)) 
giving an interval width 


Wya = |p2 — pil = EOT vV (a? /n) [(a?/n) + 4pq] - (6.6-7) 


The width of the interval decreases slowly with sample size Figure 6.6-1. 





Example 6.6-4 
(how fair is the “fair” coin) We wish to obtain information about the “fairness” of a coin. 
For this purpose the coin is tossed 100 times and 47 heads are observed. A 95 confidence 
interval on p, the probability of a head, is desired. Using the MEF we find that p' = 0.47. We 
find a from 2 x erf(a)= 0.95 or a = 1.96 and from Equation 6.6-7, Ws œ 0.192. The interval 
is centered at 0.47 and extends from 0.37 to 0.57. The interval includes the “fair” coin value 
of p = 0.5 and we have no basis for believing that the coin is biased. If the number of i.i.d. 
observations increases to 1200, and we observe 564 heads, then #’ still has value ĵ' = 0.47 
but the 95 percent interval is {0.442, 0.492} and does not include the “fair” value of 0.5. 
This strongly suggests that the coin has a slight bias in the direction of getting more tails. 








6.7 MAXIMUM LIKELIHOOD ESTIMATORS 


In the previous sections we furnished estimators for the mean, variance, and covariance of 
RVs. While these estimators enjoyed desirable properties, they seemed quite arbitrary in 
that they did not follow from any general principle. In this section, we discuss a somewhat 
general approach for finding estimators. This approach is called the maximum likelihood 
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(ML) principle and the estimators derived from it are called maximum likelihood estimators 
(MLEs). The main drawback to the MLE approach is that the underlying form of the pdf 
of the observed data must be known. The idea behind the MLE approach is illustrated in 
the following example. 


Example 6.7-1 — > o> 
Consider a Bernoulli RV that has PMF Py (k) = p*(1—p)1—*, where P[X = 1] = p, and 
P[X = 0] = 1—p. We would like to estimate the value of p with an estimator, say, p, that is 
a function only of the observations on X. Suppose we make n observations on X and we call 
these observations X1, X2,..., Xn. Then Y = a X; is the number of times that a one 
was observed in n tries. For example, the experiment might consist of tossing a coin n times 
and counting the number of times it came up heads, that is, {X = 1}, when the probability 
of a head is p. Suppose this number is kı. The a priori probability of observing kı heads 


is given by P[Y = kı; p] = (z) p™(1—p)"—*:. We explicitly show the dependence of the 


result on p because p is assumed unknown. We now ask what value of p was most likely to 
have yielded this result? Since the term on the right is a continuous function of p, we can 
obtain this result by a differentiation. Setting the derivative to zero yields 


eer wel — (p oh 4 -pal — p) -p(n ha)] = 0. 


Thus, there are three roots: p = 0, p = 1, and p = k, /n. The first two roots yield a minimum 
while p = k, /n yields a maximum. Thus, our estimate for the most likely value of p in this 
case is k,/n. Had we performed the experiment a second time and observed kz heads, our 
estimate for p would have been k2/n. These estimates are realizations of the MLE for p: 





p= 5. (6.7-1) 


In the previous example we used the fact that the distribution of }7;'_, X; is binomial. Could 
we have obtained the same result without this knowledge? After all, for some distributions 
it might be quite a bit of work to compute the distribution of the sum of RVs. The answer 
is yes and the result is based on generation of the likelihood function. 


Definition 6.7-1 The likelihood functiont L(@) of the random variables X4, 
X2,..., Xn is the joint pdf fx,x2...Xa (£1, T2, *** , Zn; 0) considered as a function of the 
unknown parameter ĝ. In particular if X1, X2,--- , Xn are independent observations on a RV 
X with pdf fx (x; 6), then the likelihood function for outcomes X, = 21, X2 = %9,...,X; = 
Li,---,Xn = In becomes 


L(@) = [] fx 9) (6.7-2) 


tStrictly speaking we should write L(0;21,22,...,a2n) or, as some books have, L(6;X1,X2,... Xn). 
However, we dispense with this excessive notation. 
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since the { X;} are i.i.d. RVs with pdf fx (z; 6). If, for a given outcome X = (11, 22,--+ , fn), 
0* (£1, £2, *** ,2n) is the value of ô that maximizes L(0), then 6" (21, £2, +- , Zn) is the ML 
estimate of 0 (a number) and6 = 8* (X1, X2,- -+ , Xn) is the MLE (an RV) for 0. It is there- 
fore, quite reasonable to define the likelihood function as the RV L(@) 4 Ili fx (X35 9). 
Then, maximizing with respect to 6 yields the MLE6(X:,--- , Xn) directly. I 


Example 6.7-2 _ — ——————— 
We consider finding the ML estimation of p in Example 6.7-1 using the likelihood function. 
If we make n i.i.d. observations X1, X2,--- , Xn on a Bernoulli RV X, the likelihood function 
becomes L(8) = [[}—; p?*(1—p)!~™* = pX i= * x (1— p)” -Ei *, By setting dL(6)/d0 = 0, 
we obtain three roots: p = 0,p = 1, and p = $`; T:/n. The first two roots yield a 
minimum, while the last root yields a maximum. Thus, p*(x) = };—; z:/n and the MLE 
of p is f = p*(X1, X2 , Xn) = yy Xi/n. 


In many cases the differentiation is more conveniently done on the logarithm of the likelihood 
function. The log-likelihood function is log L(@) (usually the natural log is used) and has 
its maximum at the same value of @ as that of L(@). Another point is that the MLE 
cannot always be found by differentiation, in which case we have to use other methods. 
Finally, multiple-parameter ML estimation can be done by solving simultaneous equations. 
We illustrate all three points in the next three examples, respectively. 


Example 6.7-3 
Assume X:N(, 07), where o is known. Compute the MLE of the mean p. 


Solution The likelihood function for n realizations of X is 


L(p) = (=) exp (-z5 dies - n?) . (6.7-3) 


Since the log function is monotonic, the maximum of L(,) is also that of log L(u). Hence 


n 


n 1 
log L(p) = -5 log(2ro?) — Io? (x; ~ ps)? 
i=1 


and set 
log L(y) 


Op 


ie —p) =0. 


Thus, the value of p, say u*, that maximizes L(x) is 
n 


--ly, 
H =at 


=0. 


This yields 
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which implies that the MLE of yz should be 

~ is 

ù=- Yx. (6.7-4) 
Thus, we see that in the Normal case, the MLE of u can be computed by differentiation 


the log-likelihood function and that it turns out to be the sample mean. 


Example 6.7-4 
Assume X is uniform in (0,0), that is, 


fx(a) = {? ae 


and we wish to compute the MLE for 0. Let a particular realization of the n observations 
Xj,...,Xn be x = (z1,-.., £n)” and let 2m 4 max(21,...,2n). The likelihood function is 


L(0) = { pore SA, 


otherwise. 


Clearly to maximize L we must make the estimate 6’ as small as possible. But a cannot be 
smaller than £m. Hence 6’ is £m and the MLE is 


6 = max(Xj,..., Xn). (6.7-5) 
The CDF of 6 for n = 2 is 
F(a) = Fx, (a) Fx, (a) = FẸ} (a). (6.7-6) 


We leave the computation of the CDF and pdf of 6 for arbitrary n as an exercise for the 
reader. 


Example 6.7-5 
Consider the Normal pdf 





fx (tino?) = = exp (=z?) 00 <2 < oo. 
The log-likelihood function, for n realizations, is 
L(y, o) Ê log L = -5 log 27 — n logo 
1 
-z3 2 (a: — u}. (6.7-7) 


Now set 
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and obtain the simultaneous equations 


n 
X (zi-u)=0 (6.7-8) 
i=1 
n 1Š 2 
-t3 da — u}? =0. (6.7-9) 
From Equation 6.7-8 we infer that 
1 n 
p=—-—) Xi 7-1 
poy) (6.7-10) 


i=1 


From Equation 6.7-9 we infer that, using the result from Equation 6.7-10, 


2 le a2 
=L Xi — p)?. 7- 
ô= 2 È) (6.7-11) 





MLEs have a number of desirable properties including squared-error consistency and invari- 
ance. Invariance is that property that says that if 6 is the MLE for 0, then hÊ) is the MLE 
for h(@). However, as seen in Example 6.7-5, (Equation 6.7-11) ML estimators cannot be 
counted on to be unbiased. We complete this section with an example that illustrates the 
invariance property. 


Example 6.7-6 
Consider n observations on a Normal RV. Assume that it is known that the mean is zero. 
The MLE of the variance is 6? = 1 Y] X?. The standard deviation ø is the square root 
of the variance. Hence the MLE of the standard deviation is the square root of the MLE 


for the variance, that is, ê = (+ Yz x2)? 








6.8 ORDERING, MORE ON PERCENTILES, PARAMETRIC VERSUS 
NONPARAMETRIC STATISTICS 


We make n iid. observations on a generic RV X (recall that X is sometimes called a 
population) with CDF Fx(x) to obtain the sample X1, X2,...,Xn. The joint pdf of the 
sample is fx(z1) x... x fx(Zn), 00 < z; < œ, i=1,...,n. Next we order the X;, i = 
1,...,7, by size (signed magnitude) to obtain the ordered sample Y,, Y2,..., Yn such that 
~oo < Yı < Yz <--- < Yn < oo. When ordered, the sequence 3, —2, —9, 4 would become 
—9, —2, 3, 4. If a sequence X; ... X20 was generated from n observations on X : N(0,1), 
it would be very unlikely that Yı > 0 because this would require that the other 19 Y;, i = 
2,...,20, be greater than zero and therefore all the samples would be on the positive side of 
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the Normal curve. The probability of this event is (1/2)?°. Likewise it would be extremely 
unlikely that Yoo < 0 because this would require that the other 19 Y;,i = 1,...,19 be less 
than zero. As shown in Section 5.3, the joint pdf of the ordered sample Y1, Yo,...,Yn is 
mfx(yi) x ++: x fx(yn), -00 < yı < Y2 < < Yn < © and zero else. We distinguish 
between ordering and ranking in that ranking normally assigns a value to the ordered 
elements. For example, most people would order the pain of a broken bone higher than that 
of a sore throat due to a cold. But if a physician asked the patient to rank these pains on a 
scale of zero to ten, the pain associated with the broken bone might be ranked at eight or 
nine while the sore throat might be given a rank of three or four. 

Consider next the idea of percentiles. We have already used this concept in numerous 
places in earlier discussions; here we elaborate. Assume that the IQ of a large segment of the 
population is distributed as N(100, 100), that is, a mean of 100 and a standard deviation of 
10. Obviously the Normal approximation is valid only over a limited range because no one 
has an IQ of 1000 or an IQ of -10. The IQ test itself is valid only over a limited range and 
may not give an accurate score for people that are extremely bright or severely cognitively 
handicapped. It is sometimes said that people in either group are “off the IQ scale.” Still 
the IQ test is widely used as an indicator of problem-solving ability. Suppose that the 
result of an IQ test says that the child ranks in the 93rd percentile of the examinees and 
therefore qualifies for admission to selective schools. How do we locate the 93rd percentile 
in a population of n students? 

Definition (percentile): Given an RV X with CDF Fx(z), the u-percentile of X is 
the number z, such that Fx(z„) = u. If the function Fy is everywhere continuous with 
continuous derivative, then £u = Fy (u), where Fx’ is the inverse function associated 
with Fx, that is, Fy’ (Fx(zu)) = tu. A CDF and its inverse function is shown in Figure 
6.8-1. In keeping with common usage, we use x, or 100 x x, interchangeably to mean 
Ly-percentile. 


u=F(x,) 


x,=F;'(u) i 





(a) 





Figure 6.8-1 (a) u versus x,; (b) The inverse function x, versus u. 
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Observation. In the special case where X:N(0,1) with CDF Fsy (z), we use the symbol 
Zu (or 100 x z,) to denote the u-percentile of X. If X:N (u, 0?) then the u-percentile of X, 
Ly, is related to z, according to 

Ly = pt Zus. (6.8-1) 


Example 6.8-1 — S S 
(relation between £u and zu) We wish to show that £u = + Zuo if X:N (u, 07). We proceed 
as follows: 

We write 





The last line is Fsy(zu), the CDF of the standard Normal RV. Hence £y = p + Zu. 
We can use this result in the previously mentioned IQ problem. From the data we have 
Fx (zy) = 0.93 = Fsn(zu). We can find z, from tables of the Normal CDF, or from 
tables of the error function (erf(z.) = FP'sn(zu) — 0.5) we get that z, œ% 1.48. Then with 
Ly = H + Z,0=1004+1.48 (10), we get that a 93 percentile in the IQ is 115. 





The Median of a Population Versus Its Mean 


The median of the population X is the point £o.5 such that Fx (zo.5) = 0.5.1 This is to be 
contrasted with the mean of X, written as py, and defined as py = fr ztfx(x)dzr. The 
median and mean do not necessarily coincide. For example, in the case of the exponential law 
where fx(z) = Ae~*"u(z), we find that py = 1/X but zo.5 = 0.69/. To compute the mean 
of X we need fx (x), which is often not known. The mean may seem like a rather abstract 
parameter while the median is merely the point zo. where P[X < 20.5]. However, given n 
i.i.d. observations X1, X2,..., Xn on X, we estimate py with the mean estimator function 
(MEF) fix = n`! Jli Xi, which happens to be an unbiased and consistent estimator for 
the mean of many populations. Indeed it is the simple form of the MEF jx and the fact 
that if 0% is finite that fix — px for large n (see the law of large numbers) that make the 
mean so useful in many applications. Realizations of the MEF are intuitively appealing as 
they give us a sense of the center of gravity of the data. 


tWhen the event {X = zo.5} has zero probability, the events {X < 29.5} and {X > 29.5} are equally 
probable at 0.5. This gives rise to the often-heard statement that the median “is the point at which half the 
population is below and half above.” But as the median is the 50th percentile, it includes the probability 
of the event {X = zo.5} and the statement should be modified to “the median is the point at which half 
the population is at or below.” The median is a parameter that characterizes the whole population. The 
median of a random sample is only an estimate of the true median. 
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Example 6.8-2 0 — ee 
(median salary versus mean salary) Consider a country where half the workers make $10,000 
per year or less and half make more. Then we can take $10,000 as the median annual income. 
Now suppose that among those making $10,000 or less per annum, the numerical-mean 
annual income is $8000 while for those making more than $10,000 per annum, the numerical- 
mean annual income is $100,000. The numerical mean income for the country as a whole in 
this case is $54,000. In your judgment, which of these figures describes the economy of the 
country better? Which of these figures would you use to put the country in a good (bad) 
light? 


Example 6.8-3 
(median and mean are not the same for the binomial) We make the somewhat trivial obser- 
vation that in the binomial case the mean and median do not coincide. For example, with 
n = 5, the mean is 2.5 but the median, such as it is, is 2. However, when n is large, the 
median and mean approach each other and the median can be estimated by the mean. 
Indeed stated without proof, the difference between the mean and median is proportional 
to (p(1 — p))”, which becomes arbitrarily small for n — oo. 





Parametric versus Nonparametric Statistics 


The situation where we know or assume a functional form for a density, distribution, or 
probability mass function and use this information in computing probabilities, estimating 
parameters, and making decisions is called the parametric statistics. Typically, in the para- 
metric case, we might assume a form for the population density, for example, the Normal, 
and wish to estimate some unknown parameter of the distribution, for example, the mean 
Lx. In Chapter 7 we make extensive use of parametric statistics in hypothesis testing. 
Much of parametric statistics is based on the Central Limit Theorem, which states that the 
distribution of the sum of a large number of i.i.d. observations tends to the Normal CDF. 
The estimation of the properties and parameters of a population without any assump- 
tions on the form or knowledge of the population distribution is known as distribution-free 
or nonparametric statistics. Statistics based only on observations without assuming under- 
lying distributions are sometimes said to be robust in the sense that the theorems and 
conclusions drawn from the observations do not change with the form of the underlying 
distributions. Whereas the mean and standard deviation are useful in characterizing the 
center and dispersion of a population in the parametric case, the median and range play a 
comparable role in the nonparametric case. To estimate the median from X1, Xo,...,Xn, we 
order them by magnitude as Y; < Y2< ... < Yn and estimate zo.5 with the sample median 
estimator 
Yet1 ifn is odd, that is, n = 2k +4 1, 


Yos = fas + Y,41) if n is even, that is, n = 2k. (6.8-2) 


The sample median is not an unbiased estimator for zo,5 but becomes nearly so when n 

is large. The dispersion in the nonparametric case is measured from the 50 percent range, 
. A . 

that is, Aro.50 = £0.75 — Xo.25, or the 90 percent range, that is, Aro 99 4 £0.95 — £0.05 OF 

some other appropriate range. These have to be estimated from the observations. 
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Index value 
10 





Yi Ys Va Vs Ye Yı Ys Yo Vio 
Ordered samples 


Figure 6.8-2 Estimated percentile range from ten ordered samples showing linear interpolation between 
the samples. To get the estimated percentile, take the index value and multiply by 100/11. Thus, to a 
first approximation, the 90th percentile is estimated from y,, while the 9th percentile is estimated from 
yı. An approximate 50 percent range is covered by yg — yo. 


Example 6.8-4 
(interpolation to get percentile points) Using the symbol a ~ 8 to mean a estimates 8, we 
have Y3 ~ 20.273, Y4 ~ £0,364 and using linear interpolation 


(Ya — ¥4)(0.3 — 4/11) 
1/11 ~ 70.3 





Ya+ 





Interpolation between samples is shown in Figure 6.8-2. 





Confidence Interval on the Percentile 


We discuss next a fundamental result connecting order statistics with percentiles. Once 
again the model is that of collecting a sample of n i.i.d. observations X1, X2,..., Xn ona 
RV X with CDF Fx (x). We recall the notation P[X; < zu] 4 u, Next we order the samples 
by signed magnitude to get Yı < Yo < --- < Y,. To remind the reader: if a set of realizations 
of the X;,i = 1,...,5, are 2] = 7, 2 = —2, 23 = 7.2, £4 = 1, £5 = 3 then the associated 
realizations on the Y;,4 = 1,...,5, are yı = —2, yo = 1,y3 = 3, y4 = 7, y5 = 7.2. From the 
subscripts on {Y;} we can make an obvious but remarkable statement on the {X;}, namely 
that the event {Y, < £u} implies that there are at least k of the {X;} that are less than 
Zu; there may be more but certainly not less. Then, because the {X;} are i.i.d., we can use 
the binomial probability formula to compute P[Y, < £u] as 


P[Y, < ty] = Plat least k of the {X;} are less than 2,| 


= ick (7) ul(l—u)?-*. (6.8-3) 
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Next consider the event {Yk+r > Zu}. Since Yk+r is the (k +r)th element in the ordering of 
the {X;}, there are at least n—(k+r)+1 of the {X;} that are greater than zu. Equivalently 
there can be no more than k +r — 1 of the {X;} less than z,. Then 


P[Yk+r > Zu] = P[no more than k+r—1 of the {X;} are less than zu] 
k+r-1 [N \ i —i 
=g (F)wa we 


The intersection of the events {Y,4- > tu} and{Y, < £u} is the event {Yp < £u < Yair}. 
Its probability is 


(6.8-4) 


i=k a 


PIYR < tu < Yk4r] = yo (7 )wa — u)” (6.8-5) 


and is independent of fx(x). The result given in Equation 6.8-5 is one of the major results 
of nonparametric statistics and has important applications as we illustrate below. 


Example 6.8-5 SSS 
(sample size needed to cover the median at 95 percent confidence) We seek the end points 
Yı, Yn of a random interval [Y1, Yn] so that the event {Y} < 20.5 < Yn} occurs with proba- 


bility 0.95. Here Y; Ê min(Xı, X2, . - . Xn), Yn & max(X1, X2, .... Xn). In effect, how large 
should n be? 
The answer is furnished by computing 


-1 
PIM < 205 < Ya] = ~~, (7) (1/2)" = 0.95 


and find that for n = 5, P[Yi < 20.5 < Y5] = 0.94. The probability that the random interval 
[Y1, Yn] covers the 50 percent percentile point is shown in Figure 6.8-3 for various values of n. 


Probability that random interval covers the median 


covered 
o o 
a w 


Probability that median is 


0 2 4 6 8 10 
Sample size 


Figure 6.8-3 Probability that the event {Y1 < x05 < Yn} covers the median for various values of n. 
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Probability that the 33rd percentile 
point is covered by the kth 
adjacent ordered pair 











12 3 4 5 6 7 8 9 
kth ordered pair 


Figure 6.8-4 Among the pairwise intervals [Yx, Yi+1], the interval [Y3, Ya] is most likely to cover 
0.33. 


Example 6.8-6 SSS 
(between which pair of ordered samples does £o.33 lie?) We have a set of ordered samples 
{¥i, Yo,...,¥,} and wish to find the pair {Y;, Yiz1, i = 1,...,n — 1} that maximizes the 
probability of covering the 33.33rd percentile point. The 33.33rd percentile point is defined 
by u = 1/3 = Fx (20.33). For specificity we assume n = 10. From Equation 6.8-5 we compute 


10! 
k!(10 — k)! 


and plot the result in Figure 6.8-4. Clearly the interval [Y3, Y4] is most likely to cover 29,33. 
The probability of the event {Y3 < 19.33 < Y4} is 0.26. 


P [Yk < 20.33 < Yeuil = (1/3)*(2/3)°-*,k = 1,...,9 





Confidence Interval for the Median When n Is Large 


If n is large enough so that the Normal approximation to the binomial is valid in distribution, 
we can use 


1 Br 
Pla < Sa < B) ® Jin f exp -3| dy, (6.8-6a) 


where 


=Q 


Pla < Sp < A] = > (7 era - p)", 


Aa-—np—0.5 

= ———, and 
Vnp(l — p) 

a B-—np+0.5 


nS Ven =P) 


To apply these results to the problem at hand, we write 


PIY, < zos < Yaral = D (7) aar, (6.8-7) 


(6.8-6b) 
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where we used that, by definition of the median, u = Fx(zo.5) = 1/2. The choice of 
subscripts will ensure that the confidence interval will begin at the rth place counting from 
the bottom, that is, from one, and end at the place reached by counting r observations back 
from the top. For example if the 95 percent confidence calculation for n = 10 yields r = 3, 
the confidence interval begins at the third observation and ends at the eighth observation, 
both points reached by counting three places from bottom and top, respectively, that is, 
1, 2, 3 (Y3) and 10, 9, 8 (Ya), and the result would appear as P[Y3 < zos < Ya] 
= 0.95. 

In the binomial sum in Equation 6.8-7 we ‘note that its mean is n/2 and its standard 
deviation is ,/n/2. Hence the Normal approximation to the binomial sum in Equation 6.8-7 
for a 95 percent confidence interval is 


a n (1/2)" = L ia exp[-+22]dz = 0.95 
i=r \ 2 V27 Jan 2 , 


which, from the tables of the standard Normal distribution (or the error function), yields 
Qn = —1.96, 8, = 1.96. Then it follows from Equation 6.8-6b that 
n—r—n/2+0.5 
vn/2 
r—n/2—0.5 
vn/2” 


which yields r = (n/2) — 1.96,/n/2 + 0.5. If r is not an integer replace r by [r], where the 
latter is the largest integer less than or equal to r. 


1.96 = 


—1.96 = 


Example 6.8-7 — SSS 
(95 percent confidence interval for the median for n=20) We make 20 observations on an 
RV X and label these {X;, i = 1,...,20}. We order them by signed magnitude so that 
Yi < Yo < + < Yn. We use r = (n/2) — 1.96,/n/2 + 0.5 to obtain r = 6.12 and |r] = 6. 
Then P[¥6 < 20.5 < Yis] > 0.95. 





6.9 ESTIMATION OF VECTOR MEANS AND COVARIANCE MATRICESt 
Let X; 4 (Xi,...,Xp)7 be a p-component random vector with pdf fx (zx). Let Xi,...,Xn be 


n observations on X, that is, the X;,7 = 1,...,n are drawn from fx(z). Then X;,i =1,...n 
are i.i.d. random vectors with pdf fx (x;). We show below how to estimates 


(i) ux £ BEX] = (His -- - Hp)? 


tThis section and the next one can be omitted on a first reading. 
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where 
A . 
L; = E[X;] j=l,.-..,p 


and 
Gi) Kxx Ê E(X — ux) (X — wx)? 


The vector and matrix parameter ty and Kxx are useful in many signal processing 
applications. They also figure prominently in characterizing the multi-dimensional Normal 
distribution [6-5]. The covariance matrix Kxx is most often a full-rank, positive-definite, 
real-symmetric matrix. The properties of such matrices are well-known [6-6] and can be 
exploited in their estimation. 


Estimation of u 


Consider the p-vector estimator Ô given by 


I> 
Sle 


Ô yx (6.9-1) 


We shall show that © is unbiased and consistent for H. We arrange the observations as in 
Table 6.9-1. 

In Table 6.9-1 X;; is jth component of the random vector X;. The components of the 
vector Y;,j = 1,...,p are n iid. observations on the jth component of the random vector 
X. From the scalar case we already know that 


a Al z A. . 
8; S a a Xi = By J =1,--- P (6.9-2) 
i= 


Table 6.9-1 Observed Data 


n columns 





The components of Y; are all that 
is necessary for estimating the jth 
component, jj, of the vector p. 
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is unbiased and consistent for p; 4 E|X;;] i = 1,--- ,n. It follows therefore that the vector 
estimator Ô Ê (Ô1,--- ,Ôp)T is unbiased and consistent for p. The vector Y; contains all 
the information for estimating »;. Thus, E[Y;] = p;i, where i 4 (1,1,---,1,1)7. 

When X is normal, © is normal. Even when X is not normal, © tends th the normal 
for large n by the central limit theorem (Theorem 4.7-1). 


Estimation of the covariance K 


If the mean p is known, then the estimator 
~aalc T 
Əs X (X: — „)(X; — u) (6.9-3) 


is unbiased for K. However, since the mean is generally estimated from the sample mean ĝ, 
it turns out that the estimator 


1 


a A 
O = —— 
n—-1 


2 X; — fi)(X; — fp)? (6.9-4) 


is unbiased for Kxx. To prove this result requires some effort. First observe that the diagonal 
elements of © are of the form 


alg n 
Sj = — dX -= AR (6.9-5) 


which we already know from the univariate case are unbiased for o3 SE [(X5 — w,)?]. Next 
consider the sequence (l # m) 


° 


Xu + Xim, Xa +Xm, Xni +Xnm, (6.9-6) 


which are n i.i.d. observations ZO, on a univariate RV Zim 4 Xı + Xm with mean 44 + Lp, 
and variance 


Var|Zim] = E[(X1 — m) + (Xm — Hm)? 
=0? +02, +2Kim (6.9-7) 


where Kim 2 E|(X1 — 4,)(Xm — Hm)] is the Imth element of Kxx. Finally, consider 


a 1 a... - 
Ôm = —— D lin — (M + m)l’, (6.9-8) 
l=1 
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which, by Equation 6.8-15, is unbiased for o? + o2, + 2Kim. If we expand Equation 6.9-8 
and use the fact that Zim, 4 X,+ Xm, we obtain 
n 
Ôm & — SI(Xia — fy) + (Xim — fn)? 


n-l1 
i=1 


1 < ~2, 1 . \2 
=- 2 (Xa — py)" + n12 im — Âm) 


2 Š . A 
+i 2 Xau — fy) (Xim — Âm). (6.9-9) 


In Equation 6.9-9, the first term is unbiased for o?, the second is unbiased for o2,, and the 
sum of all three is unbiased by Equation 6.9-8 for o? + a2, + 2Kim. We therefore conclude 
that 

1 


n—-1 





Sim = So (Xa — fy)(Xim — Âm) (6.9-10) 
i=1 


is unbiased for Kim(= Kym). Hence every term of Ô in Equation 6.9-4 is unbiased for every 


corresponding term in Kxx. In this sense 64 Kxx is unbiased for Kxx. 

By resorting again to the univariate case and assuming that all moment up to the 
fourth order exist, we can show consistency for every term in the estimator for Kx x, that 
is Equation 6.9-4. Hence without specifying the distribution, Equations 6.9-1 and 6.9-4 are 
unbiased and consistent estimators for zx and Kxx respectively. 

. When X is normal, Kxx obeys a structurally complex probability law called the 

Wishart distribution (see 6-6, p. 126). More generally, when the form of the pdf of X 
is known, one can use the maximimum likelihood method of estimatiing such parameters 
as 0%, mx and Kxx. Maximum likelihood estimators have several, but not all, desirable 
properties as estimators. The next example shows that the MLE for the mean is not a 
minimum-square estimator. 


Example 6.9-1 
([6-5], p. 21.) Consider the sample mean estimator from Equation 6.8-3, that is, 


m 
Yx 
i=l 


We recognize that this estimator is the MLE for the mean p’ Now we ask: what constant 


a in the scalar estimator Ô Ê aj will generate the MMSE estimator of u? Recall the 
X,i=1,---,n areiid. RV with E[X;] = u and Var[X;] = o?. 


sie 


p= 


Solution we are seeking the value of a such that 


Efap — p}? (6.9-11) 
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is a minimum. Clearly j is unbiased for p, and it seems hard to believe that there may 
exist. an © with a # 1 that—though yielding a biased estimator—gives a lower MSE than 
O =p. 
For any estimator Ô, the mean square error in estimating wis 
E|(6 — »)*] = El{(6 — E[6}) + (E[6] — #)}?} 

= Var[6] + (E[6] — p)?. (6.9-12) 
If © is unbiased then the last term, which is the square of the bias (Definition 6.8-2), is 
zero. For the case at hand, © = aĝ; thus 


E|(@ — y)] = a? Var[fi] + (an — p)? 
a?o? 


-ÊZ (yy. (6.913) 





To find the MMSE estimator, we differentiate Equation 6.9-13 with respect to a and set to 
zero. This yields the optimum value of a = ag, that is, 


we n 


O= (2/n) +p? (0? /n) +n’ 


(6.9-14) 





6.10 LINEAR ESTIMATION OF VECTOR PARAMETERS? 


Many measurement problems in the real world are described by the following model: 
ut) = | r(e,7)0(r)ar +n), (6.10-1) 
T 


where y(t) is the observation or measurement, T is the integration set, 9(7) is the unknown 
parameter function, h(t,7) is a function that is characteristic of the system and links the 
parameter function to the measurement but is itself independent of @(r), and n(t) is the 
inevitable error in the measurement due to noise. For computational purposes Equation 
6.10-1 must be reduced to its discrete form 


Y = HO +N, (6.10-2) 


where Y is an n x 1 vector of observations with components Y;, i = 1,...,n. H is a 
known n x k matrix (n > k),@ is an unknown k x 1 parameter vector, and N is an 
n x 1 random vector whose unknown components N;,i = 1,...,7 are the errors or noise 
associated with the ith observation Y;. We shall assume without loss of generality that 
E[N] = 0.4 


tThis section can be omitted on a first reading. 
}The symbol 0 here stands for the zero vector, that is, the vector whose components are all zero. 
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Equation 6.10-2 is known as the linear model. We now ask the following question: How 
do we extract a “good” estimate of O from the observed values of Y if we restrict our 
estimator © to be a linear function of Y? By a linear function we mean 


A 


6 = BY, (6.10-3) 


where B, which does not depend on Y, is to be determined. The problem posed here is of 
practical] significance. It is one of the most fundamental problems in parameter estimation 
theory and covered in great detail in numerous books, for example, Kendall and Stuart [6-8] 
and Lewis and Odell [6-9]. It also is an immediate application of the probability theory of 
random vectors and is useful for understanding various topics in subsequent chapters. 

Before computing the matrix B in Equation 6.10-3, we must first furnish some results 
from matrix calculus. 


Derivative of a scalar function of a vector. Let q(x) be a scalar function of the vector 
X = (21,...,2n)7. Then 
dq(x) a / ðq dq \* 
dx (2. 2 i (6.10-4) 


Thus, the derivative of g(x) with respect to x is a column vector whose ith component is 
the partial derivative of g(x) with respect to zi. 





Derivative of quadratic forms. Let A be a real-symmetric n x n matrix and let x be 
an arbitrary n-vector. Then the derivative of the quadratic form 


q(x) xT Ax 
with respect to x is 
dq(x) 
—— = 2Ax. .10- 
Tx 2Ax (6.10-5) 


The proof of Equation 6.10-5 is obtained by writing 


q(x) = D > TilijTj 


i=l j=1 
n n n 
= > Laii + > Y aiziz. 
i=1 fj 
Hence 
q(x) 


= 2£kaâkk + 2 > Onid; 
ifk 


n 
=2 J Abit; 
i=1 


Ox k 
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or 


dg(x) 

— = . 6.10-6 

dx 2Ax ( ) 

Derivative of scalar products. Let a and x be two n-vectors. Then with y = a? x, we 

obtain d 

Y 

— =a. 6.10-7 

x2 ( ) 


Let x, y, and A be two n-vectors and an n x n matrix, respectively. Then with q £ yTAx, 


a = Aly. (6.10-8) 
We return now to Equation 6.10-2: 
Y=HO+N 
and assume that (recall E[N] = 0) 
K 2 E[NN™] = 071 (6.10-9) 


where I is the identity matrix. Equation 6.10-9 is equivalent to stating that the measurement 
errors N;, that is, i = 1,...,n are uncorrelated, and their variances are the same and equal 
to a7. This situation is sometimes called white noise. 

A reasonable choice for estimating 8 is to find a Ô that minimizes the sum squares S 
defined by 


SÊ (Y —H6)7(y — HÔ) Ê ||Y — H6]/?. (6.10-10) 


Note that by finding © that best fits the measurement Y in the sense of minimizing 
|[Y — HÔ||?, we are realizing what is commonly called a least-squares fit to the data. 
For this reason, finding © that minimizes S in Equation 6.10-10 is called the least-squares 
(LS) method. It is a form of the MMSE estimator. To find the minimum of § with respect 
to Ô, write 

S = YTY +6°H'H6 - 6'H'y - YTHÔ 
and compute (use Equation 6.10-4 on the LHS and Equations 6.10-5 and 6.10-8 on the 


RHS) 


OS = 2(H7 H]6 — 2HTY, 
ae 





whence (assuming HTH has an inverse) 
Ôzs = (H™H)—/HTY. (6.10-11) 


Comparing our result with Equation 6.10-3 we see that the B in Equation 6.10-3 that 


furnishes the least-squares solution is given by By 2 (HTH)-1HT. Equation 6.10-11 is the 
LS estimator of 0 based on the measurement Y. 
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The astute reader will have noticed that we never involved the fact that K = o7I. 
Indeed, in arriving at Equation 6.10-11 we essentially treated Y as deterministic and merely 
obtained Ôzg as the generalized inverse (see Lewis and Odell (6-9, p. 6]) of the system of 
equations Y = H@. As it stands, the estimator ô Ls given in Equation 6.10-11 has no claim 
to being optimum. However, when the covariance of the noise N is as in Equation 6.10-9, 
then Ôzs does indeed have optimal properties in an important sense. We leave it to the 
reader to show that Ô Ls is unbiased and is a minimum variance estimator. 


Example 6.10-1 
We are given the following data 


6.2 = 30 + m, 
7.8 = 40 + N2, 
2.2 = 0+ ng. 


Find the LS estimate of 6. 
Solution The data can be put in the form 
y = H0 +n, 
where y = (6.2, 7.8, 2.2)” is a realization of Y, H is a column vector described by (3, 4,1)T 


and n = (nı, n2,n3)? is a realization of N. Hence HTH = EH? = 26 and H’y = 
D3 Hiyi = 52. Thus, 





3 
; > Hiyi 52 
61s = (H"H) "Hy = = = 96 =? 
x H? 26 
i=1 


Example 6.10-2 — — S 
([6-8, p. 77.]) Let 8 = (01, 02)T be a two-component parameter vector to be estimated, and 
let H be a n x 2 matrix of coefficients partitioned into column vectors as H = (H, H3), 
where H;, 7 = 1,2 is an n-vector. Then with the n-vector Y representing the observation 
data, the linear model assumes the form 


Y= (H,H2)0 +N 
and the LS estimator of @ is 
ôs- HTH, HTH, ]~* [HTY 
HTH, HTH, HTy |` 
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SUMMARY 


In the branch of statistics known as parameter estimation we apply the tools of probability 
to observational data to estimate parameters associated with probability functions. We 
began the chapter by stressing the importance of independent, identically distributed (i.i.d.) 
observations on a random variable of interest. We then described how these observations 
can be organized to estimate parameters such as the mean and variance, with emphasis 
on the Normal distribution. The problem of making “hard” (i.e., categorical) statements 
about parameters when the number of observations is finite was resolved using the notion 
of confidence intervals. Thus, we were able to say that based on the observations, the 
true mean, or variance, or both had to lie in a computed interval with a near 100 percent 
confidence. We studied the properties of the standard mean-estimating function and found 
that it was unbiased and consistent. 

We found that the t-distribution, describing the probabilistic behavior of the T random 
variable, was of central importance in constructing a confidence interval for the mean of a 
Normal random variable when the variance is unknown. 

In estimating the variance of a Normal random variable, we found that the Chi-square 
distribution was useful in constructing a near 100 percent confidence interval for the vari- 
ance. We briefly discussed a method of estimating the standard deviation of a Normal 
random variable from ordered observations. 

We demonstrated that confidence intervals could also be developed for parameters of 
distributions other than the Normal. This was demonstrated with examples from the expo- 
nential and Bernoulli distributions. 

A method of estimating parameters based on the idea of which parameter was most 
likely to have produced the observational data was discussed. This method, called maximum 
likelihood estimation (MLE), is very powerful but does not always yield unbiased or minimum 
mean-square error estimators. 

Toward the end of the chapter we introduced nonparametric methods for parameter 
estimation. These methods, also called distribution-free estimation, do not assume a specific 
distribution for generating the observational data. In this sense they are said to be robust. 
We found that a number of important results in the nonparametric case could be obtained 
using ordered data and the binomial distribution. 

Finally, we extended the earlier discussions on parameter estimation to the vector case. 
In particular, we showed how the elements of vector means and covariance matrices could 
be estimated from observational data. A brief discussion of estimating vector parameters 
from linear operation on measurement data completed the chapter. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 


6.1 If 36 of 100 persons interviewed are familiar with tax incentives for installing certain 
energy saving devices, construct a 95% confidence interval for the corresponding true 
proportion. What is the margin of error? 
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6.2 


6.3 


6.4 


6.5 


6.6 


6.7 


6.8 
6.9 


6.10 


6.11 


6.12 


We have three i.i.d. observations on X:N(0,1). Call these X;, i = 1,2,3. Compute 
fx,X_X3(%1,£2,2%3) and compare with fx, +x.+x3(y)- 

In a village in a developing country, 361 villagers are exposed to the Ebola Gay 
hemorrhaging-fever virus. Of the 361 exposed villagers, 189 die of the virus infection. 
Compute a 95 percent confidence interval on the probability of dying from the Ebola 
virus once you have been exposed to it. What is the margin of error? 

Show that the roots of the polynomial (p — 6)? — (9/n)p(1 — p) = 0 that appeared 
in Example 6.1-6 are indeed as given in Equation 6.1-1. 

Referring again to Example 6.1-6, compute |p; — p2| as 6 varies from zero to one. 
Do this for different values of n, for example, n = 0, 20, 30, 50. 

Describe how you would test for the fairness of a coin with a 95 percent confidence 
interval on the probability that the coin will come up heads. 

Consider the variance estimating functions in Equations 6.3-3 and 6.3-4. Show that 
for values of n > 20, the difference between them becomes extremely small. Repro- 
duce the curve shown below. 


Difference between variance 
estimating fuctions versus 
sample size n 


Difference 





Sample size 


Compute P[|&x(n) — #x| < 0.1] as a function of n when X : N(1,1). 

Plot the width of a 95 percent confidence interval on the mean of a Normal random 
variable whose variance is unity versus the number of samples n. 

Show that the MGF of the gamma pdf 


. -1 
f(;0,B) = (als) 2% exp(-2/8),2 > 0; a > —1,8 > 0 
is M(t) = (1 — bt) 7+9, 
We make n i.i.d. observations X; i = 1,...,n on X : N(p,o07) and construct Y; = 
Xi- 
SE, Use the result of Problem 6.10 to show that the pdf of Wn £ Y2, YÈ is 
x?, with n degrees of freedom that is, 
fw (Win) = (1/2) — 112/2) 20/2 exp(—(1/2)w), w > 0. 


We make n i.i.d. observations X; on X:N(,07) and construct à = n~! Xj, X; and 
6? = (n—1)71 5", (Xi — A)? Show that fi and 6? are independent. (Hint: It helps 
to use moment generating functions; if all else fails consult Appendix F). 
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6.13 


6.14 


6.15 


6.16 
6.17 
6.18 


6.19 


6.20 


6.21 


6.22 


6.23 





x— 
Let X:N(u,07) and Wn:x2 be two independent RVs. With Y 2 5 p 


a) Show that the joint dencity of Y and W, is given by: 








1 W("-2)/2 exp(—0.5 
frw, (y, w) = Jan exp(—0.5y7) x ae —oo < y < co, w > 0; 
a Y on) = -9A 1 
b) Let T= Wn show that fr(t;n) = mrin 2/2] = 7/2] x + 2 Jnr)” 


—oo < t < oo. This the “Student’s” t-pdf. (Hint use a proper two variable-to- 
two variable transformation.) 


Let (X1, X2,-.-, Xn) be a random sample of a uniform random variable X over 
(0,a), where a is unknown. Show that A = max(X1, X2,...,Xn) is a consistent 
estimator of the parameter a. 

Use Matlab™, Excel™, or some other scientific computing program to create a 95 
percent confidence interval for the mean of a Normal random variable X : N(0, 1). 
Use 50 observations per single interval computation and repeat the experiment 50 
times. For each experiment record the length of the interval and whether it includes 
the mean, which in this case is zero. Repeat for 100 observations per interval compu- 
tation. 

Show that the sample variance in Equation 6.2-3 is unbiased. 

Show that the sample variance in Equation 6.2-3 is consistent. 

Suppose that we want to estimate the true proportion of defectives in a very large 
shipment of adobe bricks, and that we want to be at least 95% confident that the 
error is at most 0.04. How large a sample will we need if we know that the true 
proportion does not exceed 0.12? 

Consider a box that contains a mix of red and blue balls whose exact composition of 
red and blue balls is not known. If we draw n balls from the box with replacement 
and obtain k red balls, what is the maximum likelihood estimate of p, the probability 
of drawing a red ball? 

Find a 95 percent confidence interval for the variance o% of the distribution. 

Show that the number a in Equation 6.6-6 is a = 2(145)/2, —@ = Z(1—6)/2; that is, a 
is the (1 + 6)/2 percentile of the Z:N (0, 1). 

An optical firm purchases glass to be ground into lenses. It knows from past expe- 
rience that the variance of the refractive index of this kind of glass is 1.26 x 1074. 
Suppose that the refractive indices of 20 pieces of glass (randomly selected from 
a large shipment purchased by the optical firm) have a variance of 1.20 x 1074. 
Construct a 95% confidence interval for ø, the standard deviation of the population 
sampled. 

Let X1, X2, X3 be three observations on X:N(px,0%). Let V; 4 Xv balm) for i = 
1, 2,3. Show that a VŽ is Chi-square with two degrees of freedom. 

A random sample of size n = 100 is taken from a population with o = 5.1. Given that 
the sample mean is Z = 21.6, construct a 95% confidence interval for the population 
mean p. Since the sample size n = 100 is large, normal approximation is assumed. 
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6.24 A method has been developed to estimate the size of the fish population by performing 
a capture/recapture experiment. Let N be the actual population to be estimated. r 
animals are first captured and tagged. The r animals are then released and allowed 
to mix into the general population. Later, n animals are captured (or recaptured) 
and the number of tagged animals k is counted. Determine the maximum likelihood 
estimate of N. 

6.25 Show that the covariance estimating function of Equation 6.3-1 is unbiased and 
consistent. 

6.26 Find the mean and variance of the random variable X driven by the geometric 
probability mass function Px (n) = (1—a)a"u(n). Compute a 95 percent confidence 
interval on the mean of X. 

p-p 


6.27 In Example 6.5-3 the claim is made that P 
vpa/n 


—a < 








< a| = 6 is identical 


with P[(p — p)? < a?pq/n] = ô. Justify this claim. 
6.28 Compute the maximum likelihood estimate for the parameter di in the Poisson pmf. 
6.29 Let X be uniformly distributed in (—1,1) and let Y = X?. Find the best linear 
estimator for Y in terms of X?. Compare its performance to the best estimator. 


6.30 Compute the MLE for the parameter p & P{success] in the binomial PMF. 

6.31 Compute the MLE for the parameters a, b (a < b) in fx(z) = (b—a)~} (u(x — a)— 
u(x — b)). 

6.32 [6-2] Consider the linear model Y = Ia + bx + V, where 


y4(%,...,¥,)7 

A 
v= (Vis -ee Va)? 
IÊnxn identity matrix 


x= (z1, En) 

a =(a,...,a)? 
and a,b are constants to be determined. Assume that the {V;, i = 1,...,n} aren 
i.i.d. Normal random variables as N(0,0?), the z;,i = 1,...,n are constant for each 
i= 1,...,n, but may vary as 7 varies. They are called control variables. 


(i) Show that {Y;,i = 1,...,n} are N (a + bz;, 07); 
(ii) Write the likelihood function and argue that it is maximized when $`; (Yi- 
(a + bz;)? is minimized; 
(iii) Show that the MLE of a is âômz = Êy — bmLĒ and the MLE of b is byt, = 
Vin (z: — ZY; 
Dien (t: — 2)?’ 


where 
> A 
by = (1/n) Yin Yi 
2 = (1/n) Yi- Ti- 
6.33 The mean height of students in a college follows a Normal distribution with mean 


173.3 cm and standard deviation of 6.4 cm. Determine the 95t percentile of the 
height random variable. 
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6.34 Compute the median for the geometrically distributed RV. 
6.35 Compute the mean and median for the Chi-square random variable. 
6.36 Show that an estimate for the 30th percentile, x93, is given by the interpolation 


formula Y4 + Ya Ya) 05-4 1) a Zo.3, where the {Y;, i = 1,...,10} are the ordered 
random variables formed from the set of unordered i.i.d random variables {X;, i = 
1,...,10}. 


6.37 How large a sample do we need to cover the 50th percentile with probability 0.99? 


Hint: Use the formula P[Y; < zo.5 < Yn] = ane (7) (1/2)" = 0.99, 


where Y, © min(X1, X2,-.-Xn),¥n & max(X),Xo,... Xn). 


*6.38 Show that the joint pdf of the ordered random variables Y3, Yp, where Yı £ 


min(Xı, Xə, - - - Xn), Yn 4 max(X), X2,...Xn), is given by 


Praya (Yi Yn) = n(n — 1) (Fx (Yn) — Fx (yi)? fx (i) fx (Yn); -00 < y1 < Yn < 00 


Hint: Consider the joint pdf of all the Y1, Y2,--- , Y, and integrate out all but the 
first and last. 


*6.39 Let {Y;, i = 1,...,n} be a set of ordered random variables. Define the range R of 


the set as R ê Yn — Yı. Now consider six observations on X{X;,i = 1,...,6} 
from the pdf fx(z) = u(x) — u(x — 1), where u(x) is the unit-step function. 
Show that fr(r) = 30r4(1 —r),0 < r < 1. Hint: Use the result fy, y,(Y1, Yn) = 
n(n — 1) (Fx (yn) — Fx(y1))”” fx (yr) £x (Yn), -00 < yı < Yn < 00, and define two 
random variables R ê Ya — Yı, S 4 Yı and find frs(r,s). Then integrate out with 
respect to S. 
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Statistics: Part 2 
Hypothesis Testing 


Hypothesis testing is an important topic in the broader area of statistical decision theory. 
Statistical decision theory also includes other activities such as prediction, regression, game 
theory, statistical modeling, and many elements of signal processing. However, the ideas 
underlying hypothesis testing often serve as the basis for these other, typically more 
advanced, areas.t 

Hypotheses take the following form: We make a hypothesis that a parameter has a 
certain value, or lies in a certain range, or that a certain event has taken place. The so-called 
alternative hypothesis! is that the parameter has a different value, or lies in a different range, 
or that an event has not taken place. Then, based on real data, we either accept (reject) the 
hypothesis or accept (reject) the alternative. Parameter estimation and hypothesis testing 
are clearly related. For example, the decision to accept the hypothesis that the mean of one 
population is equal to the known mean of another population is essentially equivalent to 
estimating the mean of. the unknown population and deciding that it is close enough to the 
given mean to deem them equal. 

In the real world we often are forced to make decisions when we don’t have all the facts, 
or when our knowledge comes from observations that are inherently probabilistic. We all 
(probably) know heavy smokers who live well into their eighties and beyond. Likewise, we 
know of nonsmokers that die of lung cancer in their fifties. Does this mean that smoking 
is unrelated to lung cancer? In days of old, the chiefs of tobacco companies said yes while 


tThere are several textbook references for this material, for example [7-1] to [7-4]. 
tThe alternative hypothesis is often called, simply, the alternative. Thus, one encounters “we test the 
hypothesis. . . versus the alternative. ...” 
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cancer epidemiologist said no. In view of all the evidence accumulated since then, no reason- 
able person would now argue that smoking does not increase the likelihood of dying from 
lung cancer. Nevertheless, unlike what happens when a person falls off a 20-story building 
onto concrete, death by lung cancer or other smoking-related disease does not always follow 
heavy smoking. The relationship between smoking and lung cancer remains essentially prob- 
abilistic. In the following sections we discuss strategies for decision making in a probabilistic 
environment. 


7.1 BAYESIAN DECISION THEORY 


In the absence of divine guidance, the Bayesian approach to making decisions in a random 
(stochastic) environment is, arguably, the most rational procedure devised by humans. 
Unfortunately, to use Bayes in its original form requires information we may not have 
with any accuracy or may be impossible to get. We illustrate the application of Bayesian 
theory and its concomitant weakness in the following example. 


Example 7.1-1 0 — ŽS 
(deciding whether to operate or not) Assume that you are a surgeon and that your patient’s 
x-ray shows a nodule in his left lung. The patient is 40 years old, has no history of smoking, 
and is otherwise in good health. Let us simplify the problem and assume that there are only 
two possible states: (1) The nodule is an early onset cancer that without treatment will 
spread and kill the patient and (2) the nodule is benign and doesn’t pose a health risk. We 
shall abbreviate the former by the symbol Ç, and the latter by Ç>. The reader will recognize 
that the outcome space (read sample space) 9 has only the two points, that is, Q = {¢1, C2}, 
but—in more complex situations—could in fact have many more. The surgeon’s job is to 
make that decision (and take subsequent action) that is best for the patient. The trouble 
is that without an operation the surgeon doesn’t know the state of nature, that is, whether 
Çı or Ç, is the case. There are two terminal actions: operate (a) or don’t operate (az). 

It is not always clear as to what “best” means. However, it seems quite reasonable, other 
things being equal, that “best” in this case is that decision/action that will minimize the 
number of years that the patient will lose from a normal lifetime. There are four situations 
to consider: 


) The surgeon decides not to operate and the nodule is benign; 
) The surgeon decides not to operate and the nodule is a cancer; 
) The surgeon operates and the nodule is benign; 

) The surgeon operates and the nodule is a cancer. 


(1 
(2 
(3 
(4 


Prior data exist that lung nodules discovered in nonsmoking, early middle-age males are 
benign 70 percent of the time. Thus, the probability that a nodule is cancerous for this 
group is only 0.3. The surgeon is also aware of the data in Table 7.1-1. 

The terms {I(a;,¢,;),4 = 1,2;7 = 1,2} are called loss functions and I(a;,¢;) is the 
loss associated with taking action a; when the state of nature is ¢;- The reader might 
ask why [(ai,¢,) = [(ai,¢2) = 5 and not zero. Surgeons know that operations are risky 
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Table 7.1-1 


Then the number of years 
And the state of | subtracted from a normal 
If the decision is nature is life span is Ka, ¢) 


Don’t operate (action a2) Benign lesion (¢,) l(a2,¢,) =0 


Don’t operate (action a2) Cancer (¢,) I(a2,¢,) = 35 
Operate (action a1) Benign lesion (¢.) l(ai ća) =5 
Operate (action a1) Cancer (¢,) l(a, 1) =5 


procedures and that even healthy patients can suffer from post-operative infections such 
as from MRSAt and gram-negative bacteriat. Unless absolutely necessary, most surgeons 
will avoid major invasive surgery in preference to non-invasive procedures. Thus, due to 
infections and other complications any surgery carries a risk and, counting the people who 
die from surgical complications, we assign an average loss of five years. 

Next, we introduce the idea of a decision function d. The decision function d is a 
function of observable data so we write d(X1, X2, ... Xn), where the {X;,i = 1,...,n} are 
n i.i.d. observations on a random variable (RV) X. The decision function d(X1, X2,...Xn) 
helps to guide the surgeon with respect to what action, that is, a, or a2, to take. In our 
example we limit ourselves to a single observation that we denote X, specifically the ratio 
of the square of the length of the boundary of the nodule to the enclosed area. This is a 
measure of the irregularity of the edges of the nodule: The more irregular the edges, the 
more likely that the nodule is a cancerous lesion (Figure 7.1-1). Thus, we expect that most 
of the time the RV X for the cancerous lesion (¢,) will yield larger realizations than those 
yielded by X for the benign case (¢.). A realization of X in this case is the datum. 





(a) (b) 


Figure 7.1-1 (a) A benign lesion tends to have regular edges; (b) a cancerous lesion tends to have 
irregular edges. 


tMethicillin-resistant Staphylococcus aureus. 
}These bugs prevail in hospitals and cause infections that are difficult to treat. 
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c x 


Figure 7.1-2 There is a value of c (to be determined) that will minimize the expected risk. A datum 
point in the region T2 4 (—00, c} is more likely to be associated with a benign condition and will lead 


to action a2 (don't operate), while a datum point in Ty 2 [c, co) is more likely to be associated with a 
cancer and will lead to action a; (operate). 


Let f(z;¢,) and f(z; z) denote the pdf’s of X under conditions ¢, and Ç}, respectively 
(see Figure 7.1-2). In this example we assume, for simplicity and ease of visualization, that 
these pdf’s are unimodal and are continuous. Further, as shown in Figure 7.1-2, we assume 
that there exists a constant c such that if the datum falls to the right of c it will be taken as 
evidence that the opacity is a cancer. Likewise, if the datum falls to the left of c it will be 
taken as evidence that cancer is not present. If the evidence suggests a cancer then action 
a, follows; else, action az follows. Since this is a probabilistic environment errors will be 
made. Thus, 


Plai|¢a] = T f(z; C2)dz (7.1-1) 


is the error probability that the evidence suggests there is a cancer that requires an operation 
when in fact there is no cancer. Likewise 


Pjaal]= f fcd (1.1-2) 


is the error probability that the evidence suggests there is no cancer and therefore the action 
is not to operate while in fact there is a cancer. 

The conditional expectation of the loss when the state of nature is C and the decision 
rule is d is called the risk R(d;¢). Thus, 


R(d;¢,) = l(a; 61) Plai|¢,] + U(a2;¢,)Plaal¢;] 
and (7.1-3) 


R(d; C2) = L(a1; C2)P[a1 |2] + U(a2; C2) Plaal¢o). 
Finally, the expected risk, labeled B(d) defined ast 
B(d) = R(d;¢,) Pl = 61] + R(d; C2) PIC = Ca), (7.1-4) 


tThe symbol B is used in honor of the mathematician/philosopher Tomas Bayes (1702-1761). 
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is the quantity to be minimized. A decision function d* that minimizes B(d) is a Bayes 
strategy. 
Thus, 


B(d") = min [ra CJPI = 1] + R(d; C2) PIC = cal} (7.1-5) 


The probabilities 
A A 
P, = PIC = (4), Pe = Pl = a] 
are called the a priori probabilities of the state of nature. In terms of the symbols introduced 
above, we can write B(d) as 


B(d) = P, x U(a2,¢,) + Pe x U(a2, Ça) 


+ J {Pate [1(a1,¢2) — U(a2,C2)] — Paf(2;¢1) lilaz, 1) — U(ar,¢1)] han 


(7.1-6) 


where we choose c to minimize B(d). If the integral in the expression for B(d) is positive, 
it will add to B(d), but if the integral is negative, it will reduce B(d). Indeed if we choose 
c, say c = c*, so that (c*,0o) leaves out all the points where the integral is positive but 
includes the points where the integral is negative, then we have minimized B(d). Outcomes 
(read events) that make the integral negative are described by 
X; — P. 

F(X; g) > [2(a15 C2) — Uae, C2)] Pe Ê ky, (7.1-7) 

F(X; S2)” Elaz, Cı) — Lar, 01) Pi 
which is the Bayes decision rule. It says that for all outcomes? (c*,o0) take action a; 
(operate). Likewise for all outcomes (—oo, c*), that is, 


f(X; c1) < kp 


f(X: Ca) 


take action az (don’t operate). The constant c is the point that satisfies 


f (c*; Cı) =k 

mra. n T fo 

f(e; Ca) 
The prior probabilities in this example would be computed from aggregate information on 
thousands of patients who sought help for similar symptoms. The nodule observed in a 40- 
year, nonsmoking male is more likely to be benign than cancerous; for example, it might be 
a harmless opacity, some residual scar tissue, or even the intersection of blood vessels giving 
the appearance of a lesion. For simplicity we shall assume that we know these probabilities 
as Pi = 0.7, Po = 0.3. Then specializing Equation 7.1-6 for this case yields 


Bà = 105+ | ” (8.5 f(a; Cg) — 9f (2; C1)) dz, 


tRecall that under the mapping of the (real) RV X, events are intervals on the real line. 
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which implicitly yields the constant c* from 3.5f(c*;¢,) — 9f(c*;¢,) = 0. Then the Bayes 
decision rule is 

f(X31)/F(X5 2) > 0.39 — operate 

f(X3¢,)/f(%3¢2) < 0.39 — don’t operate. 


In Example 7.1-1 only a single RV was used in making the decision. In many problems, 
however, a decision will be based on observing many i.i.d. RVs. In that case the Bayes 
decision rule takes the form 


f(X1501) (Xn; Ca) — Ela, Ca) — Ilaz, C2) P2 a 
O FAG) > URARI CENTA = kp, accept ¢, as state of nature 


F(X; a) (Xn: Ca) Elai, Ca) — Ilaz, a)] Pe a : 
FOI TA) < Ilaz, 6) lan, G) P, = kp, reject ¢, as state of nature. 


(7.1-8) 


The reader will recognize that the numerator and denominator in Equation 7.1-8 are the 
Te 

likelihood functions L(¢;) = JI fx(Xi;¢;), J = 1,2 discussed in Chapter 6. Therefore 
i=1 


i= 
Equation 7.1-8, being a ratio of two likelihood functions (the likelihood ratio) that is being 
compared to a constant, is quite appropriately called a likelihood ratio test (LRT). The 
constant kp in Equation 7.1-8 is called the Bayes threshold. 

Every Bayes strategy leads to an LRT but not every LRT is the result of a Bayes 
strategy. The Bayes strategy seeks to minimize the average risk but other LR-type tests 
may seek to abide by different criteria, for example, maximizing the LR. subject to a given 
probability of error. One problem with implementing the Bayes strategy is that the a priori 
probabilities P) and P2 are often not known. Another problem is that it may be difficult to 
assign a reasonable “loss” to a particular action. For example, say that you are preparing a 
large omelet and need to break a dozen eggs. You are thinking of using a Bayes strategy to 
minimize the loss, that is, the amount of work you have to do. Your choices are to use one 
bow] or two bowls and the random element here is whether an egg is good or bad. Suppose 
that, on average, for every 100 good eggs there is one bad egg. If you use only one bowl 
and a bad egg is added to the others before you realize that it is bad, then you have ruined 
the whole mixture. If you use two bowls, a small one in which you inspect the contents of 
a newly broken egg before adding it to the other eggs, and a large one containing all the 
(good) broken eggs, then you avoid ruining the mixture if the egg is bad. Now, however, you 
have two bowls to wash instead of one when you are finished. How would you reasonably 
define the loss in this case? While this example is perhaps not terribly serious, it illustrates 
one of the problems associated with trying to apply the Bayes strategy. Another problem is 
that it may be difficult to estimate prior probabilities for rare events. For example, suppose 
a country wants to use its antimissile resources against an attack by a hostile neighbor. 
If the defense strategy is designed according to a Bayes criterion, knowledge of the prior 
probability of an enemy attack is needed. How would one estimate this in a reasonable way? 
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7.2 LIKELIHOOD RATIO TEST 


Because prior probabilities are often not available and loss functions may not be easily 
defined, we drop the constraint on minimizing the expected risk and modify the Bayes 
decision rule as 


F (Xai Ga) ++ F(Xni Ga) 
F (X15 Ca), -e f(Xni Ca) 
F(X Ga) F(X) 
F(Xi; 62). wo (Xn; Ca) 


> k, accept ¢, as state of nature 
(7.2-1) 
< k, reject Ç} as state of nature, 


where the threshold value k is determined from criteria possibly other than that of Bayes. 
Common criteria are related to the probabilities of rejecting a claim. when the claim is 
true and/or accepting the counterclaim when the counterclaim is true. This kind of test is 
known as a likelihood ratio test that tests a simple hypothesis (the claim) against a simple 
alternative (the counterclaim). To save on notation we define the LRT random variable as 


A Ê f (X15 G) F(XnsGy)/F (X15 Ce) (Xni Ca) 
= L(¢;)/L(¢2) 


(7.2-2) 


and illustrate its application in an example. 


Example 7.2-1 —— > 
(testing a claim for a food) Consider a health-food manufacturer who claims to have devel- 
oped a snack bar for kids that will reduce childhood obesity.t The snack bar, while tasty, 
supposedly acts as an appetite suppressant and thereby helps reduce the desire for fattening 
in-between-meals snacks such as potato chips, hamburgers, sugar-sweetened soda, chocolate 
bars, etc. To test the validity of this claim, we take n children (a subset of a large, well- 
defined group) and give them the weight-controlling snack bar. After one month, the average 
weight for this group is 98 lbs with a standard deviation of 5 Ibs. The other children in the 
group, that is, the ones not taking the weight-controlling snack bar, average 102 lbs with 
a standard deviation of 5 lbs. We make the hypothesis? that the weight-controlling snack 
bar has no effect in controlling obesity. This is called the null hypothesis and is denoted 
by Hı. The alternative, denoted by Ho, is that the weight-controlling snack is helpful in 
controlling obesity. It does not matter which hypothesis we designate as Hı but once the 


tObesity among children is a severe problem in the United States. Extrapolated from the present rate 
of caloric consumption, it is predicted that in 2020 three out of four Americans will be overweight or obese 
(Consumers Reports, December 2010, p. 11). 

The meaning of this word is: an assumption provisionally accepted but currently not rigorously 
supported by evidence. A hypothesis is rejected if subsequent information doesn’t support it. 

§In many books the null hypothesis is denoted by Ho and the alternative is denoted by Ha. We prefer 
the numerical subscript notation. 
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choice is made, we are required to be consistent throughout the problem. In the absence of 
well-defined loss function, we focus instead on the probabilities of error. We define 


aê P[based on our test we decide that Hz is true| H; is true] 
BÊ P[based on our test we decide that H; is true| H2 is true]. 


With X;, i = 1,..., n, being i.i.d. RVs denoting the weights of the n children, we assume 
that the weights of both groups are Normally distributed near their means,‘ that is, 


1 1 (x; —102\? 


1 1 (x; -—98\? 
J Ha) = Foon] 5 ( 5 Ji 


We note that fx(zi,Hi) (fx(2i, H2)) is the pdf of X;, i = 1,...,n, under the condition 
that Hı (H2) applies. 
Then from Equation 7.2-2, 


rsh 5 Ee- e) 


Further simplification yields 











An . 
A= Kr exp (Fax) , 


n 
where jix(n) Sn) >> X; and K, is a constant independent of the {X;,i = 1,...,n} but 
i=1 . 
dependent on the sample size n. The decision function then becomes 


An 


if Kn exp E 


hx (n) > kn, accept H, (reject H2) 


4 
if Kn exp (Fhxtn) < kn, accept Ho, (reject Hı). 


Since the natural logarithm of A(InA) is an increasing function of A (Figure7.2-1), we can 
simplify the decision function using (natural) logs and aggregating various constants into a 
single one. Then the test becomes 


if y(n) > cn, accept Hy 
if x(n) < cn, accept H2, 


where c, is another constant that depends on the number of children in the test n, and is 
determined by the criterion we impose. If H; is true then ĝy(n) is N(102,25/n) that is, 


tThe Normal characteristic is taken to be valid around the center of the pdf, say, a few o values on either 
side of the mean. It definitely is not valid in the tails. For example, what would you make of a “negative 
weight”? 
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In(x) versus x 


Intx) 





x 


Figure 7.2-1 The natural logarithm of x is an increasing function of x. 


1 1 [z — 102]? 
fhem Ta OP (+ Ea ) 


while if Hə is true fix(n) is N(98, 25/n) that is, 


1 1 [x — 98)? 
ale, Ha) = ——— exp | -= |= . 
fale, Ha) = Tea om ( aA] ) 
The pdf’s of fix (n) under Hı and Ho are shown in Figure 7.2-2. 


Suppose by way of a criterion we specify a = 0.025. Recall that a 4 Placcept that Ho is 
true|H, is true]. Then 


En Cn, — 102 
0.025 = i falx, Hi) dz = Fu(en) = Fsn (3) = Fn (20.025), 


which, from the Normal tables and simplifying, gives a threshold value c, = 102—(9.8/,/n). 
As elsewhere the symbol Fsxy (z) stands for the CDF of the standard Normal RV evaluated 
at z. Thus, acceptance of Hı requires the event {102 — (9.8/./n) < jfix(n) < oo}. This 
can also be written as (102 — (9.8/,/n), 00) since intervals on the real line are events under 
RV mappings. The influence of the sample size on the threshold is shown in Figure 7.2-3. 
The power of the test increases with increasing sample size, as shown in (Figure 7.2-4). 
Increasing power means that the probability of making an error when Ho is true decreases. 


The error probability a is called the probability of a type I error and the significance 


level of the test. The probability P 41- 8 is called the power of the test and £ itself is 
called the probability of a type II error. The power of the test is the probability that we 


tThe error probability a is sometimes called the size of the test. 
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Figure 7.2-2 The pdf's f(x, H1) and fa(x, H2) for Example 7.2-1. 


Threshold versus sample size for a = 0.025 
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Figure 7.2-3 As the sample size increases, the threshold value moves to the right. 


reject the null hypothesis given that the alternative is true. In general, it is not possible to 
make both a and extremely small even though it is not true, in general, that a+ 8 = 1. 

With reference to Example 7.2-1 we address a question some readers might have 
regarding this discussion, namely since the children eating the weight-controlling snack bar 
average 4 lbs less weight than their counterparts, why not simply accept this as evidence 
that the weight-controJling snack bar works? This would ignore the fact that even in the 
heavier group of children, a weight of 98 lbs is within one standard deviation from the mean 
of 102 Ibs, meaning that if the sample size is small we could be in error in concluding that 
the snack bar is useful. Moreover, such a naive approach would tell us nothing about the 
probability that we are mistaken. 


Example 7.2-2 
(difference of means of Normal populations) In some nutritional circles, there is a belief that 
bringing aid to Third-World malnourished children by way of a diet rich in omega-3 oils (e.g., 
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Power of LRT for o = 0.025 





Power 
i=] 
a 


0 20 40 60 80 100 120 
Sample size 


Figure 7.2-4 The power of the test increases with increasing sample size, which is a good thing. The 
best test would maximize the power of the test for a given n and æ. In this example, the test is indeed 
the best test. 


fish) and complex carbohydrates (whole wheat, bran, brown rice, etc.) can increase a child’s 
IQ by 10 points by age 13, besides improving health. To test such a claim, one might want to 
measure the IQ of children brought up on such a diet against the IQ of children brought up on 
the local diet. Typically the data would be the sample mean ji(7n) of the IQs of the n children 
fed the experimental diet. If we denote the true but unknown mean by p the test might take 
the form Hy : pyg = 110 versus H2 : uq = 100. There are several variations on this type of 
test, for example, Hı : p = a versus Ho: p # a and H; :a < u < b versus Ho: p <a, p >b. 
We consider the elementary test Hı : u = b versus H2 : p = a(b > a) for a Normal 
population with, say, variance 07. We assume a random sample of size n, meaning that we 
have n iid. RVs X1,..., Xn. Then if H; is true X;:N(b, o°) while if H3 is true X;:N(a, 07). 
The LRT random variable is 


es 

~ 
A] 
3 
9 


2)? exp (—3 [%:]?) 


(2702) ™? exp (—} [%*]’) 


> 
II 
i 





(7.2-3) 


alll 


° 
fl 


which, after simplifying, taking logs, and aggregating constants, yields the test 


ii(n) > cn, accept Hy (reject Hı) 
p(n) < cn, reject Hı (accept H2). 


The constant c, is determined by our choice of a. For example with a = 0.025, we must 
solve 


a = Placcept H2|H, true] = 0.025 


= [omen (+(e) ) 4 


' 


= T 1/(2m)°* exp (—1/2y?) dy = Fsn (=) = Fn (20.025); 
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where ch £ (Cn — b),/n/o. From the tables of the standard Normal CDF, we find that 
20.025 = —1.96. Solving, we get cn = b — (1.960/,/n). Notice the similarity between this 
example and Example 7.2-1. The power of the test is 


P = 1 — Placcept Hı|H2 true] 


= 1 — (2ro? /n) 1 T exp (os Bea ') dz (7.2-4) 
b — a — (1.960 / Vn) 
= Foy (Sen). 


The reader will recognize that the power of the test is simply the probability of accepting 
Hy when H3 is true. Returning to the IQ problem that motivated this discussion, we find 
that for a = 0.025, b = 110,a = 100, o = 10, and n = 25, the acceptance region for Hy is 
the region to the right of c, = 106.1. In other words, when the event {106.1 < A(n) < co} 
occurs, it suggests that a good diet helps to overcome the IQ deficiency of malnourished 
children. The power of the test is 0.999. 


Neyman-Pearson Theorem. Suppose we are asked to find a test for a simple hypothesis 
versus a simple alternative that, for a given a, will minimize 8. Such a test will maximize 
the power P = 1 — ĝ and is therefore a most powerful test. What is this test? The Neyman- 
Pearson theorem (given here without proof) furnishes the answer. 


Theorem 7.2-1 Denote the set of points in the critical region by Rẹ (i.e., the region 
of outcomes where we reject the hypothesis Hı). Denote the significance of the test as a 


meaning Placcept H2|H is true] < a. Then R; maximizes the power of the test P 4i- B 


if it satisfies 
n A FG) A Xniby) oy 


F(X1,62) ++ f(Xni Ce) 


for some fixed number k, which determines Ry. B 





(7.2-5) 


Discussion. The Neyman~-Pearson Theorem (NPT) says that the likelihood ratio test, 
subject to the above constraints, that is, at significance a, is the most powerful test. In this 
sense it is an optimal test. The relationship between R,, k, and a is not explicitly stated 
by the theorem but becomes clear in working a problem. 


Example 7.2-3 . 

(chicken feed for making large eggs) A producer of chicken feed claims that a new product 
“Eggrow,” when fed to chickens, will cause the laid eggs to be larger than those laid by 
chickens fed ordinary feed. With ordinary feed, the chickens raised by this producer lay 
eggs that on the average weigh 60 grams per egg, with a standard deviation of 4 grams. 
Twenty-five chickens fed on “Eggrow” produce eggs whose average weight is 62 grams with a 
standard deviation of 4 grams. Let the hypothesis be H; : y = p) = 62 and the alternative 
be Hə : p = p, = 60. The significance level of the test is 0.05. According to the NPT, 
the test 
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—— e eŮ— < 

(2716) ~1/? exp (-3 [23;82]) 
that defines the critical region Ry, is the most powerful test. Then A = 
exp (+ + (60)? — (62)?) and taking logs, aggregating constants, and simplifying, yields 


the test 
if fi < cn, reject Hı, accept H2 


if i > Cn, accept Hy, reject Ho, 


where cn is an unknown constant. To find c, and the rejection region Rpg, we solve 


0.05 = L ap exp ( 3 (7 ie) ) dz 


and find that cn = 60.7 and R, = (0,60.7). Thus, if & < 60.7, reject Hı, accept H2. The 
test is most powerful and P ~ 0.81. 








7.3 COMPOSITE HYPOTHESES 


In the previous section we mentioned that in practice there are tests of the form: H; :a < 
u < b versus Hə : u < a,u > b and others. All of these tests have one thing in common: 
either Hı or Hz or both deal with events whose sample space has many outcomes. In the 
case of the simple hypothesis versus the simple alternative, the sample space had only two 
points ¢, and Ç}. In the case of composite hypotheses, the test 


A (Xici) Xn Gs) 
AS a) SEa) (7.3-1) 


has no meaning because there are many more Ç’s than just Ç; and Çs. To understand the 
material in this section, the reader should recall that in the estimation of parameters by the 
mazimum likelihood method (MLM) the idea was to find the parameter @ in the likelihood 
function L(0) that was most likely to have yielded the observed result. Often this could 
be found by differentiation but not always. In the problems discussed so far, there was no 
need to maximize the likelihood function to find the most likely 6 because 0 had one of 
two values, either ¢, or Çə. Suppose that the parameter of interest is the mean, that is, 
0 = p. Then in a problem such as Hj : p = pf versus H3 : p Æ Ho the maximization of the 
likelihood function associated with H2 requires searching for the optimum value of p in the 
parameter space (—o0, 00). In other words, while the hypothesis in this case is simple, the 
alternative is not: It is said to be composite. 

Fortunately not all composite hypothesis problems require such a search. We can still 
use the Neyman—Pearson rule and its desirable most-powerful property. We illustrate with 
an example. 
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Example 7.3-1  — SSS 
(testing the hypothesis H; : u = p versus the alternative H3 : p < p) We assume a Normal 
population with mean u and variance o7. At first glance it would seem that the likelihood 
function associated with H2 : p < pı requires a search. However, we can reduce this 
problem to a simple hypothesis versus a simple alternative by a slight modification of the 
Hz hypothesis. That is, we modify the problem to Hı : p = p versus H3 : p = py < wy, 
where jy is as yet arbitrary. Then 


A = exp (-= (È (Xi — m)’ — > (Xi - m?) <k (7.3-2) 


i=1 


is the LRT for the critical region for Hı. Simplifying, taking logs, and aggregating all 
constants, we obtain the test: if 4 < cn reject Hı. To find the constant c, we proceed as 
before; that is, we use the type I error criterion, that is, the significance level of the test. 
Thus, say, with a = 0.01 and the pdf f(z; u1) = N (u1, 0°/n) we solve 


0.01 = A f |-os(: eha) Jez 


to obtain cn = p; — 2.320 /yn. Thus, we reject Hı if & < p — 2.320 / Vn. Note that we 
never had to specify an actual value for po. 


= 











Generalized Likelihood Ratio Test (GLRT) 


The GLRT is useful for solving composite hypotheses problems. First, recall that some 
likelihood functions are functions of one parameter, some of two parameters, etc. For 
example, the likelihood function associated with an n-sample of i.i.d. exponential RVs is 


L(A) = A” exp(—A È X;)u(X;)t and is a function only of the parameter 0 = , while the 


likelihood function "associated with an n-sample of i.i.d. Normal RVs is 


n 2 
L(p,0) = (2r0?) exp (-} 5 j= H] ) 


i=1 





and is a function of two parameters O = (p,o). The likelihood function associated with 
a two-dimensional (multivariate) Normal would be a function of five parameters, that 
is, 4, 42,01,02, P19. We use the notation L(0) to indicate a likelihood function of the 
parameters 0 = (61,02,...,0k). Now consider the following problem: Let © denote the 
global k-dimensional parameter space; for example, in the univariate Normal this would 
be © = (—œ < u < 00,0 < ø < 00). Let ©) denote the parameter space (a subspace 
of O) associated with the hypothesis Hı. For example if X:N(,0%) and the hypothesis 


tThe function u(z) is the unit step:u(z) = 1,x > 0, and zero elsewhere: 
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is Hı: 3 < uy < 4, then O; = (3 < py < 4, 0 < 0% < oo). Define the test statistic A for 
testing Hı : @ € ©, versus the alternative H3 : @ ¢ 0, as 


a Lim(6*) 
Lom (6")’ 


where Lrm(0*) 4 maxgce, L(@) and Lem (8) 4 maxgce L(@). We may ask why A, as 
given in Equation 7.3-3a, is a reasonable test statistic for accepting or rejecting Hı. First 
recall that maximizing the numerator gives us the most likely parameter estimates, restricted 
to Q4, to account for the observations. Because our search is restricted to 6), the maximum 
in this parameter subspace may not be a global maximum; hence we call it a local maximum. 
Next, maximizing the denominator gives us the most likely unrestricted parameter estimates 
that account for the observations; hence we call it a global mazimum. The subscripts LM 
and GM are there to remind the reader of the “local-max” and “global-max” operations, 
respectively. We observe that A is a random variable with its realization confined to [0,1]. 
(Question for the reader: Why is this so?). Now if the realizations of A are close to one, 
then we assume that Hj is true; that is, the unknown parameters are in ©, but, in fact, are 
also the most likely parameters in the whole space. On the other hand, if the realizations 
of A are small or close to zero we may assume that the most likely parameters are not 
in @,. The threshold value c denotes the point at which we go from accepting (rejecting) 
the hypothesis to accepting (rejecting) the alternative Hz and is usually determined by 
enforcing the significance level a. In summary then, the GLRT is described as 


reject Hı if A <c, (7.3-3b) 


where A is given in Equation 7.3-3a. It has been shown that under certain conditions, the 
GLRT is asymptotically optimal in the Neyman-Pearson sense. However there exist counter- 
examples in the literature that prove that the GLRT is not always optimal [7-22]. In this 
sense it must be regarded as being empirical. 

We illustrate the application of the GLRT with several examples involving continuous 
distributions. 


Example 7.3-2 — — > 
(testing H; : p= p; versus He: p # py when X is Normal and o? known) We make n 
observations on a Normal RV with known variance g?. The likelihood function is 


Lu) = (270?) exp (— 325 È (X; = n)? ) 


(7.3-3a) 


n (7.3-4) 
= (270?) -"/? exp (-= È [x - â}? +(G- wel) . 


To go from line 1 to line 2 we generated some cross-terms in the argument of the exponent 
that vanish in the summation. We leave the algebraic steps as an exercise to the reader. 
Since ø? is specified, the space O is (—oo < u < œ). Then Lam(u') is obtained when 
pt = pi, that is, Lemw(ut) = L(t), and since O; contains only one point it follows that 
Lyra (u"*) = L(m). From Equation 7.3-3a we get 


_ L(y) | no. 
A= Try = oP (—5o5(@- m)*) (7.3-5) 
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and the critical region is associated with outcomes of jz that are far from p. When 7 is 
near 4, A will take values near 1 and we would tend to accept Hı. Likewise when ji is far 
from p, it is unlikely that Hy is true and we reject it. Somewhere in between is a constant 
c such that 0 <A < c describes the critical region. Taking natural logs, we find that the 
critical region is defined by 


ji > py + (207 In(1/c)/n) 7? (7.36) 
p< m — (20? In(1/c)/n)"/? , 


where c is determined by the significance level a of the test. 


Example 7.3-3 
(numerical realization in Example 7.3-2) Here we obtain a numerical evaluation Equation 
7.3-6. Assume that p, = 5, 07 = 4,n = 15, and a = 0.05. With fa(z; u) denoting the 
pdf of A we must compute 0.05 = fj fa(x)dx. But the event {A < c} is identical to the 
event{—oo < InA < Inc}, which in turn is identical to {—2Inc < -2m A < oo}. From 


— H 








Equation 7.3-5, —2InA = (È , which is y? with one degree of freedom, that is, x2 
o . 


(the subscript indicates the degree of freedom). Denoting the x2 pdf by f,2(z;n) we write 
0.05 = f fa(z)dz = J Jfxz(z;1)dz = 1 — Fe (2n c; 1). 
0 —2 logc 


From the tables of the CDF of the x? RV we obtain —2Inc = 3.84. Hence from Equation 
7.3-6 we determine the critical region as ~% > 6.01, ĝ < 3.99 or, as interval events mapped 
by Ê (—oo, 3.99) U (6.01, co). 


Example 7.3-4 
(testing the telephone waiting time when the call is in a queue) A call to the Goldmad 
Investment Bank (GIB) gets an automatic (robotic) operator that announces that during 
business hours the average waiting time to speak to an investment consultant is less than 30 
seconds (0.5 minutes). We wish to test this claim using the GLRT. We make n calls to the 
GIB during business hours and record the waiting times X;,i = 1,..., Xn, assumed to be 
iid. exponential random variables each with pdf fx,(z; u) = (1/p) exp (—x z/ u) u(x), where 


p= E(X;),i = 1,...,n. From basic probability we know that ĝ = (1/n) È X; is an unbi- 





ased, consistent estimator for u. We test the hypothesis H 1 p< 0.5 versus Aa: p> 0.5. 

The likelihood function is L(u) = (1/p)" exp (-2 jam $ > xl) = (1/p") exp (—njfi/p). 
i=l 

Then Leu (u') is obtained by differentiation with respect to u to obtain 


Lem(u") = L(fi) = a" exp(—n). 


Finding Lim(p*) is a little more sophisticated. To illustrate what is going on we plot two 
likelihood functions in Figure 7.3-1, one that peaks at the mean of 0.45 and another that 
peaks at the mean of 0.55. The u space ©; = (0, 0.5] is based on our hypothesis that u < 0.5 
and includes the global maximum point 0.45. However, when the likelihood function is the 
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Likelihood functions for different means 
#=05 
(max. value in ©) 





Likelihood functions 
o 
N 


0 0.2 0.4 0.6 0.8 1 1.2 1.4 


Figure 7.3-1 The upper curve is the likelihood function when the true mean is at 0.45 and n = 10. 
The lower curve is the likelihood function when the true mean is at 0.55 and n = 10. The subspace 
@1 = (0,0.5] includes the point 0.45 (shown as the dotted line on the left of the solid line) but not the 
point 0.55 (shown as the dotted line on the right of the solid line). 


lower curve in Figure 7.3-1, which peaks at u = 0.55, the local maximum is not the same 
as the global maximum since the point 0.55 is not in ©, = (0,0.5]. 
Hence n), 
t i” exp(—n), @ < 0.5 
Eum(u!) = {he exp(—2nji), À > 0.5. 


The subspace ©; = (0, 0.5] includes the point 0.45 (shown as the dotted line on the left 
of the solid line) but not the point 0.55 (shown as the dotted line on the right of the solid 
line). 

Finally, from Equation 7.3-3a, we get 


E l 1, <5 (7.3-7) 
| (2)" exp (-nf[2å — 1]), à > 0.5 

The critical region is the interval (0,c’); that is, all outcomes A € (0,c’) would lead to the 
rejection of Hı. The critical region is shown in Figure 7.3-2: On the A axis it is below the 
horizontal line atc’; on the j axis it is to the right of & = c. 

Because the likelihood function decreases monotonically with Å in the region ~ > 0.5 
(Figure 7.3-2), we can use ji as a test statistic. Assuming that n is large enough for the 
Normal approximation to apply to the behavior of ~, at least where the pdf has signifi- 
cant value, that is, within a few sigmas around its mean, we have ji : N (p, u?/n) since 
Ê is an unbiased estimator for yu. In writing this result we recalled that the variance of 
a single exponential RV is 4? and therefore the variance of ji is of = p?/n. We create 


the approximate standard Normal RV from Z 4 (È — )./n/p and compute c from the 
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Likelihood ratio test 


a 


0.5 c á 


Figure 7.3-2 Variation of the GLR test statistic with the sample mean estimator. 


Migration of cut-off point with significance 


level 
c 
1 
0.8 
0.6 = _ 
0.4 
0.2 
0 
0 0.1 0.2 0.3 0.4 


Significance level 


Figure 7.3-3 As a increases, the cut-off point decreases thereby increasing the width of the critical 
region. 


significance constraint a. Using the percentile notation 1 — a = Fisn(zi_a), we find that 
c= H + 2~ap//n, from which we see that the critical point c increases linearly with p. 
We reject the hypothesis when ji > c. For example with u = 0.5,a = 0.05, and n = 10 we 
find that zọ.95 = 1.64 and c = 0.76. As a increases, the cut-off point decreases toward 0.5 
(Figure 7.3-3). 


Example 7.3-5 
(evaluation of cancer treatment by the drug Herceptin) Newer treatments for cancer involve 
disabling the proteins that fuel cancer. For example, some breast cancers contain a protein 
called HER2. In such cases, the drug Herceptin is partially effective in treating the cancer 
in that it reduces the cancer recurrence by 50 percent.! Tumors that do not exhibit HER2 
have better prognoses than those that do. Since Herceptin has significant toxic side effects, 
it is important that the test for the HER2 protein is accurate but this is not always the 
case. Let Hı: tumor has a high level of HER2 and therefore will respond to Herceptin, and 





t“Cancer Fight: Unclear Tests for New Drug,” New York Times, April 20, 2010. 
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let Ho: tumor has low levels (or none at all) of HER2 and therefore the patient should not 
be given Herceptin. It is estimated that in current testing for HER2: 

Pidecide H; is true|H2 is true] = 0.2 

Pidecide Ho is true|H, is true] = 0.1. 
Hence, the tests have a significance level of 0.1 and a power of 0.8. 





How Do We Test for the Equality of Means of Two Populations? 


Assume that there is a drug being tested for androgen-independent prostate cancer. The 
drug is administered to a group of men with advanced prostate cancer. Does the drug extend 
the lives of the participants compared with those of men taking the traditional therapy? 
A printing company is evaluating two types of paper for use in its presses. Is one type of 
paper less likely to jam the presses than the other? The Department of Transportation is 
considering buying concrete from two different sources. Is one more resistant to potholes 
than the other is? Some of these problems fall within the following framework. We have two 
populations, assumed Normal, and we have m samples from population P1 and n samples 
from population P2. Is the mean of population P1 equal to the mean of population P2? In 
general, this is a difficult problem, essentially beyond the scope of the discussion treated 
in this chapter. More discussion on this problem is given in [7-1]. However, when one can 
assume that the variance of the populations is the same, the problem is treatable analytically 
in a straightforward way. In preparation for discussing this problem, we review some related 
material in Example 7.3-6. 


Example 7.3-6 
(preliminary results for Example 7.3-7) We have samples from two Normal populations 
Sı = {Xu i = 1,...,m} and S2 = {Xg;,i = 1,...,n}. The elements of Sı are m i.i.d. 
observations on Xj with X1:N(y,,07). Likewise, the ‘elements of Sz are n i.i.d. observations 
on Xz with X2:N (u3, 02). Further, assume that E[(X1; — 4)(X2; — #2)] = 0, all i, j. 


(i) Assuming 4; = 43 = p, show that Elf, — ĝ] = 0. 
Solution to (i) Elf, — jg] = Elf] — Ely] = #4- p = 0. 

(ii) Assume that p = Ho = p and o1 = a2 = a. Show that Var(f, — ff.) = (m~t + 
n-1)o?. 
Solution to (ii) Since Elji,] = Eljig] = n, Var(dy — lg) = E[(ĝi — fig)?] = Ela] + 
E|ji3] — 2E[f jig]. Substitute 


Eliz] = m? (ž E(Xj,) + +5 $ sorua) 





a2 2 n n n 
Eljig] = n~ ar FO + D0 2 El(XziX2;) 
E(Xi) = 
E(fy fig) =p 
into the expression for the variance and obtain the required result. 


S4 
9, 
& 
s 


t Androgen-independent means that the cancer is not fueled by testosterone. It is difficult to treat. The 
authors’ colleague, Prof. Nick Galatsanos, an important contributor to the science of image processing, died 
from this illness at the age of 52. 
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(iii) Show that if V and W are Chi-square with degrees of freedom (DOF) m and n, 
respectively, then U êy + W is Chi-square with DOF m +n. 
Solution to (iii) IfV : x2, and W : x2 then V = 7", Y? and W = X`; Z7, where we 
can assume that Y;,i = 1,...,m, are i.i.d. N(0,1) and Z;, i =1,...,n, are i.i.d. N (0,1). 
The MGF of V is My (t) = Elexp(tV] and is computed as 


My (t) = (en? f -f exp(t an y7)x exp(—1/2 an ve) TT,_, d: 
=] ony? f exp (—0.5(1 — 2t)y?)dyi 
= (1 — 2t)-™/? for t < 1/2. 


Line 1 is by definition; line 2 is by the i.i.d. assumption on the Y;’s; and line 3 results from the 
total area under the Normal curve being unity. Because U = V+W and V and W are jointly 
independent, it follows from the discussion in Section 4.4 that My (t) = My (t)Mw (t). Since 
My (t) = (1—2t)~/? and My(t) = (1—2t)~"/? it follows that My (t) = (1—2t)~("+”)/2, 
which implies that U : x2,,n. 


(iv) Given the likelihood function L = (270?)~™/? exp[—0.5 307, ((Xi — u)? /a?)] show 
that Lam = L(j',67'), where, in this case, pi! = js and 671 = ô? 
Solution to (iv) We obtain ji! by differentiating In L with respect to y and obtain pt = 


as (m)? E72 Xi. Likewise, we obtain ô”? by differentiating with respect to o? and obtain 


ô?! = 6? £ m YE (Xi — a)”. Substituting into the expression for L we compute Lem as 


m m/2 ) 
L = | -—_.___—_- am /2 7.3-8 
om = oe) e (7-3-8) 


Example 7.3-7 
(testing H; : pi= Hg versus He : py # 2,0 = 0% = o? not known) As in Example 
7.3-6 we have samples from two Normal populations Sı = {Xu,i = 1,..., m} and Sy = 
{Xz i =1,...,n}. The elements of Sı are m i.i.d. observations on X, with X; : N (m, 02). 
Likewise, the elements of S2 are n i.i.d. observations on X2 with Xz : N(u3,02). Further, 
assume that E[(X1; — mı)(Xz; — u2)] = 0, for all 2,7. We shall test Hı : pı = fy versus 
Hz : p # p2. The parameter space! for Hı is ©, = (p, o?) while the global parameter 
space is O = (H1, H2,0°). The likelihood function is 


(m+n)/2 2 2 
1 loom /f/Xii— m Llyn (Xai — be 
r= (z) me ( 2 &i= ( o x exP |7 Ži o 


(7.3-9) 





tTo avoid excessive notation we denote a parameter space such as © = {—00 < Hı < 00, —00 < Hg < 
00,0; > 0,02 > 0} by Ə = {p,, H2, 01, 02} etc. for other cases, when there is no danger of confusion. Then 
the expression L(O) can be interpreted as the likelihood function of parameters in the space ©. 
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and from the results of (iv) in Example 7.3-6 we obtain 
a =m OE Xu = ĝa 
ji = Dia Xai = fa 
t= arm (Dea (Xu — M)? + Dea (Xai — fig)?) = 6?. 


We insert these results in L for 4, H2, and o?, respectively, to obtain 


L ( m+n yr exp ( (m+n) 
GM >= a / NAM y n i9., n yy na D 2n 
2r (Paaa (Xu — fy)? + Din (Xai — Â2)?) 2 
(7.3-10) 


Returning now to the likelihood function in Equation 7.3-9, we wish to maximize this in 
the parameter subspace O}. Since in H) 4) = H = p, we rewrite L as 


(m+n)/2 2 2 
{1 Lym (Xiui-p Lyn (Xai — pb 
Ln, 0) = (z) = ( 3 aint ( o ) ) xox (-i D ( o i 
7 
Straightforward differentiation with respect u and o? yields ĝ* and ô?”* as 
oe 1 m ; n 
H = m+n Oo, Xii + en Xai) 
m 


= min t mon 





he 


and 





1 m n mn 
a2e ye _7.)2 ~ a ye 
7 = mtn (£L (Xu — Ay) +e (Xai — fig)” + man ita) ) . 


When ĝ* and ô?”* are substituted for u and o? in L(O1) of Equation 7.3-11, we obtain Ly 
as 


(m+n) /2 
(m+n)e"* 


2r (Ea (Xu — Ma)? + DE, (Xai — fig)? + 22 (ôn — fig)?) 
The likelihood ratio A Ê L LM/Lam is computed as 


op rr ~ fin) 
Dier (Xu — fi)? + Dia (Xai — fig)? 


From (ii) in Example 7.3-6, ji; — jg is distributed as N (0,07(m+n)/mn), so that 


(7.3-12) 


—(mtn)/2 
A= f+ | 
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Figure 7.3-4 Critical region shown in heavy lines. It is easier to test Hı versus Hz using a test on 
Tan than on A. 


A versus T 





-4 —t, 0 te 4 


Figure 7.3-5 Instead of doing the test on the GLR statistic, it is more convenient to do the test on the 
T statistic. See Equation 7.3-13. The critical region along the T-axis is shown in heavy lines. For the 
reader's interest, for this graph m = n = 10. The hypothesis is rejected if |T| > te, where te depends 
on the type | error a. In a two-sided test at significance œ we assign a/2 error mass to each half of the 
critical region, that is, P[T > te] = a/2 and P[T < —t,.] =a/2. 


is distributed as N(0, 1). Likewise, 


A m [(Xu- hy n (Xa — hN? 
Wmtn-2 = en ~oz +» -z 


is Chi-square with DOF m +n — 2 by (iii) of Example 7.3-6. Finally, recall that Tin4n—2 = 


2a is the t-distributed RV with DOF m +n — 2 so that 
V m+n—2 


—(m+n)/2 
 (7.3-13) 


A= f + (Thin—2/(m +n — 2)) 


Since A is a monotonically decreasing function of T?,,, 2, the test can be made on T2 n2 
rather than on A. Then the critical region for Hı of the form 0 < A < 4, translates, when 
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the test is done on T2 }n—2, as the critical region (t, 00) (Figure 7.3-4) or, equivalently, as 
the union of the events (te, 00) and (—co, —t,) (Figure 7.3-5). More information on this type 
of test, so-called t-test, can be found in [7-10] to [7-15] and/or on the Internet by entering 
t-test in Google™ or another search engine. 

Under the constraint of a type I error œ we reject the hypothesis if the event 
{T? >. j2} occurs, where t; 4/2 is obtained from the t-distribution tables with m+n—2 
degrees of freedom using Fr(ti-a/2) = 1 — a/2. 


Example 7.3-8 l 
(numerical example of testing H; : p1 = pa versus Hg: p; # po) We call on a Gaussian 
random number generator (these are available on the Internet) and generate 15 samples from 
a N(0,2) population (P1) and 15 samples from a N(2,2) population (P2). We reproduce 
the numbers here: 


From population P1: S, = {2.21, 0.83, 0.393, 0.975, 0.195, —0.069, — 1.91, 1.44, —3.98, 0.98, 
—2.84, —1.56, —0.4, —1.08, 0.116}; A, = —0.258; m = 15; J35 (X4; — 2)? = 40.48. 

From population P2: Sg = {—1.28, —0.258, —0.947, 5.85, 1.56, 1.48, 1.95, 3.22, 1.41, 1.84, 
2.69, 3.94, 2.04, 2.08, 1.44}; fy = 1.801; n = 15; 57/5, (X4, — Ah)? = 45.66. We insert the 
data in 





mn(m + n)? (ĝa — fig)? 
Lia (Xu — fay)? + Yia (Xai — ĝa)? 
and obtain the realization for T? as 10.34. Finally with a = P(reject H,|H true) = 0.01, 
we find that Fr(ti—a/2) = 1 — 0.005 = 0.995 with DOF of 15+15-— 2 = 28. From the tables 
of the t-distribution we find that t1—a/2 = 2.763 or t?_, j2 = 7.63. Since T? > ta jar We 
reject the hypothesis that the means are the same. 


T? 2 (m+n—2) (7.3-14) 


Testing for the Equality of Variances for Normal Populations: 
the F-test 


Another problem we encounter is whether two Normal populations have the same vari- 
ance. The model is the following: We have two Normal populations P1, N(,07), and P2, 
N (2,03), and collect m samples (i.e., we make m iid. observations) Sı = {Xu i = 
1,...,m} from P1 and n samples Sg = {X2:,i = 1,...,n} from P2. Based on these 
samples we wish to test the hypothesis that H, : 0? = o2 Ê ø? versus the alternative 
that Hz : of # o%. The parameter space for testing Hı is ©, = {u1, 2,07} while the 
parameter space for Ha is the global parameter space O = {u, H2, 02,02}. The likelihood 
function is 


L(@) = (2ra?) ™/2 exp (-os yr i (==) ‘) 
i= Ti 


Xo — 2 
x (2002)~™/? exp (55, (=) ) , 
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which in O; = {, 42,07} assumes the form 


L(91) = (20?) -7 +7/2 exp (-s5 [dom Xu- an)? + 2 (Xai - ma?) ). 


The parameters that maximize L(©) in O; are, as usual, obtained by differentiating In L(91) 
with respect to 4, 2,07 and setting the derivatives to zero to obtain 


AT = (m)! Dica Xu =Â Ag = (n)? OL, Xai =Â; 
=(m+ n)! (Dic (Xu — i)? + Dı (Xai — ĝ2)°) i 


When these results are inserted into L(91), we obtain 


m n —(m+n)/2 
Lim = (= s [D (Xu — f,)? + et (Xai — ia) exp (—(m + n)/2). 


To maximize L(@) in © = {4, H2, 07,03} we differentiate log L(@) with respect to p4, U2, 02, 
aż and set the derivatives to zero to obtain 


m 

jl = mE Xu = hå EL, Xzi = fg 
m 

ôi = (m)? _ (Xu- fy)? = 64 MLiog) = (n)~* _, (Xai — fig)” = 63 mr- 
i=1 


We note that the maximum likelihood variance estimators a ML ÎS, mL Of the variance 
o?,o% are not unbiased. When we substitute these results into L(8), we obtain Loy as 


—m/2 —n/2 
—(m+n 1 m a 1 m a 
Lam = (2r) +2 G an (Xu — m)?) (ż an (Xai — ha?) 
x exp(—(m+n)/2). 
Finally, with A = Lm /Lam we obtain 


mtn 


This formidable-looking expression can be dramatically simplified by recognizing that 


(m — 1)6} = SE, (Xu — fy)? 
(nm — 1)63 = hy (Xz; — fg)? 


) (m+n) /2 


(7.3-15) 
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Lambda versus F} 


0 1 2 3 4 . $ 6 7 
Vr 


Figure 7.3-6 The test statistic A versus the variance ratio Ve for m = n = 10. 


so that, after a little algebra, we obtain 





.2\m/2 
(I(m-1)/(n- 1) x $) 
A= A(m,n) 5a, ORY)? 
(1+ [(m—1)/(n—-1)] x H) 


where A(m,n) Ê (m + n)(™+")/2m-m/2n-"/2. It is natural to call Vg £ 62/62 the (esti- 
mator) variance ratio, where 


a m-1) Yi X: — A)? 


Vr = t ——, 7.3-16 
R (nm — 1) Pi X — ĝa)? ( ) 
Then, in terms of Vp, 
—1 —1 m/2 
A= Alm, nym D/D x Va) y 2 A(Vp). (7.3-17) 
(1+ [(m—1)/(n — 1)] x Vr)” 
When H, is true Ve = Fm-—iyn—1, where Fm-1,n—1ı is the random variable with the 


F-distribution with m — 1 and n — 1 degrees of freedom, respectively. The variation of 
A with Vp is shown in Figure 7.3-6 for m = n = 10. It should be clear from the figure that 
rejection of the hypothesis, that is, the event {0 < A(Vp) < c}, is equivalent to the two-tailed 
event {0 < Vr < ti} U {tu < Vpr < 00}. Hence, given a significance level a, we can solve for 
tı and t, from P[0 < Vr < t;]+Pltn < Vr < co] = a, using A (tı) = A(t,,). But for simplicity 
and without much loss of accuracy, we choose P[0 < Vr < ti] = Pitt, < Vr < œ] = a/2, 
the numbers t; and t’, being easier to determine than the numbers t; and tu. See Figure 
7.3-7. Indeed with Fr(zg;m — 1;n — 1) denoting the CDF of the RV Fm-1,n—1 evaluated 
at the 8 percentile point, that is, Fr(zg;m —1;n — 1) 4 B, we observe that t; = rg/2 and 
tu = Z1-a/2- 

The hypothesis H, is rejected when the test yields the event {0 < A < c} or, equiva- 
lently, when {0 < VR < £a/2} or {£1—a/2 < Va}. 
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Figure 7.3-7 The event {0 < A < c} is equivalent to the event {0 < Ve < tı} U {tu < Vr < oo}. 
The numbers t; and t, are replaced by numbers ¢ and t, that make the error in both tails a/2. 





Example 7.3-9 
(numerical example of testing H; : 0? = 0% £ o% versus He: of # o2) We test the hypoth- 
esis that the variances of two populations are the same. 

We call the RANDOM.ORG routine available on the Internet and create two sets of 
Gaussian pseudo-random numbers as shown in the two rows below: 


N(0,1): 0.436, —1.06, —1.11, 0.46, 0.491, —1.05, 0.502, 0.598, 1.61, 
—0.981, —0.021, 0.253, —1.24, 0.059, 2.12: 
N(0,4): 0.634, 0.0818, —1.32, 2.96, 3.11, 3.13, 2.62, —1.96, 0.85, 
6.51, —3.39, 4.25, —1.08, 3.42, 2.72; 


From the top two rows, that is, the (N(0,1)) data we compute f, = 0.074,6, = 
1.01, 6? = 1.01; from the bottom two rows, that is, the (N(0,4)) data we compute fi. = 
0.54, 62 = 3.04, 63 = 9.25. We compute the variance ratio as 


a (5 =1) Sis (Xu — 0.54)? _ 9.25 _ 
(15 — 1) 3732, (Xa — 0.074)? 1.02 


At the level of a = 0.05, and using the “equal-area” system for distributing the error proba- 
bility, we seek the percentile points numbers £o.025 and Zo.975 such that Fr(o.025; 14; 14) = 
0.025 and Ff(zo.975; 14; 14) = 0.975. As an alternative to using F-distribution tables, we 
call the Stat Trek Online Statistical Table for the F-distribution calculator, and enter the 
degrees of freedom (14 in both cases) and the CDF value of 0.025 to obtain x9 925 = 0.34. We 
repeat with the CDF value of 0.975 and obtain 29.975 = 2.98. Thus, the acceptance region 
is the interval (event) (0.34, 2.98) and the critical region is the event {(0, 0.34) U (2.98,00)}. 
The test statistic yields 9.06, an event deep in the rejection region and therefore associated 
with the rejection of the hypothesis that the two variances are the same. Therefore we 
conclude, quite rightly, that the data come from different populations. 


9.06. 





More on the so-called F-test can be found in [7-5] to [7-9] and online by a Google search on 
the entry “F-test.” 
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Testing Whether the Variance of a Normal Population Has a 
Predetermined Value 


In this situation we consider a Normal population and test whether the variance of this 
population has a predetermined value. We proceed as follows: We take m samples from 
a Normal population X, that is, make m i.i.d. observations on X that we label: {X;,i = 
1,...,m}. Under Hı we assume that the variance of the population is the predetermined 
aż. The alternative hypothesis H% is that the variance of the population is not equal to o2 
or, more precisely, that there is not enough evidence to support the validity of Hı. As usual 
we begin with the likelihood function and maximize it, respectively, in @, = {u,02} and 
© = {u, 0°}. Thus, L(@1) = (2702)-™/? exp (-3 ie (==) which is maximized 


i=1 70 


when ji* = ĝ Ê (m)-! E Xj. Thus, 


loam (X;-p\? 
_ 2\—m/2 -5 i— H 
Lim = (2700) my exp ( 2 i=l ( 00 ) l 





Likewise, 





which is maximized when pl = fh 


Hence 
1 m X.—a\2 
_ 22\—m/2 L 2 H 
Lam = (26°) v= ( PN z )). 


The generalized likelihood ratio is then 


A=Lzitm/Lem ean? m/2 ean 
= (m2, ( 2 E) ) exp (-osyo7, (534) +m/2) . 


We note that W Ê D % — fi)/o0)* is x2,_,. Then 








A = ((m) Ww) ™ exp (-0.5 (W — m)), 


which is graphed as W versus A in Figure 7.3-8 for a DOF = 9. 

From Figure 7.3-8 we deduce that the critical event, that is, the event {0 < A < c}, 
is equivalent to {0 < W < tı} U {tu < W < co}, where A(t,) = A(ty) and tı < ty. 
For simplicity, however, we might choose the “equal area” rule by which we seek numbers 
ti < ta such that t; = tq/2 and t, = Z1—a/2, where £a/2 and £1_¢/2 are a / 2 and 1—(a/2) 
percentiles, that is, Fy2(%o/2;m — 1) = a/2 and F,2(z1~2/2;m — 1) = 1 — (a/2) and, as 
usual, œ = Plreject Hi|H true}. 
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Lambda versus Chi-square 





wW 


Figure 7.3-8 The critical region for A, shown in heavy line along the ordinate, can be related to a 
two-sided critical region on W (shown in heavy lines along the abscissa). 


Example 7.3-10 
(numerical example of testing H; : o? = 0% versus Hs: o? # o2) For testing purposes we 
draw two sets of Normal random numbers from populations we call P1 and P2, respec- 
tively. The P1 population is N(1,1) while the P2 population is N(1,4). We shall test both 
populations for the hypothesis that a? = 1. The numbers are from RANDOM.ORG available 
on the Internet: 





2 


N(1,1) [P1] —0.0644 2.91 -0.323 1.21 2.66 0.45 1.26 0.923 1.96 1.62 
N(1,4) [P2] 0.705 0.685 0.718 1.03 2.52 1.96 0.417 2.69 —1.52 2.98 


From the P1 data we compute W’ = 10.3. At the 0.05 level of significance the critical region 
is the event {0 < W < 2.7}U(19 < W < oo}. Since W is outside this region, we accept the 
hypothesis that the variance of the P1 population is one. We repeat the experiment using 
the P2 data. Here we compute W’ = 16.5; this is still in the acceptance region (barely) 
so we accept the hypothesis (in error) that the variance of the P2 population is one. We 
repeat the experiment at the 0.2 level of significance and find that the critical region is the 
event {0 < Z < 4.17} U (14.7 < Z < co}. We find that we still accept the hypothesis that 
P1 has a variance of one but reject the hypothesis that P2 has a variance of one. There are 
two points to be made from this example: (1) Small sample sizes can lead to errors and any 
results drawn from them should be viewed some skepticism; (2) recalling the meaning of a, 
we see that if this parameter is chosen to be very small, the critical region becomes very 
small so that rejection of the hypothesis becomes unlikely. 





7.4 GOODNESS OF FIT 


An important problem in statistics is to test whether a set of probabilities have a predeter- 
mined set of specified values. For example, suppose we wish to determine whether observed 
data come from a standard Normal distribution. Then from the Normal tables of the func- 
tion Fsy we can compute probabilities of the form p; = Fsn(2i41) — Fsn(xi),t = 1,...,1, 
and compare these numbers with data obtained from multiple, independent observations on 
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an RV X. We can test other distributions in the same way, be they discrete or continuous. 
The general model is that of sorting the data into / “bins” and comparing for i = 1,...,1 
the estimated probability p; with the specified probability p;. Typically, if Y; denotes the 
number of outcomes in n trials classified as belonging to “bin” i, then f; = Y;/n. If all of the 
{#;} are close to the corresponding {p;}, it is likely that the data come from a population 
that has the predetermined probabilities. However, if two or more of the p; are far from the 
corresponding p;, we cannot conclude that the tested population has the same parameters 
as the assumed one. The choice of the number of “bins,” say l, for a discrete random variable 
with a finite number of outcomes (the elements of the sample space) is typically the number 
of outcomes; thus for a die, | would be six, and for a coin, ! would be two. When we deal with 
continuous random variables, the “bins” become intervals (;,2;41) i = 1,...,1 associated 
with the J outcomes of the form {z; < X < 2i41,i=1,...,1}. Now the choices of l requires 
more thought. How “refined” a test do we need? A refined test, that is, one that. contains 
many bins, will typically need far more data than are bins. Acquiring so much data may 
be costly or unrealistic. However, if we choose to make an “unrefined” test, that is, select 
a small number of bins, our test will necessarily be coarse. Alternatively, a large number of 
bins with insufficient data can lead to gross errors and make our test meaningless. 

Such considerations are, more properly, in the province of experimental design and data 
processing. As such they are beyond the scope of the material in this book. 

In the goodness-of-fit test, the hypothesis H; is that a set of probabilities {p;,i = 
1,...,2} satisfies {p; = po i = 1,...,1}. The given probabilities {po;,i = 1,...,1} charac- 
terize a probability function such as a distribution function, the outcome probabilities of a 
fair die, etc. We make n i.i.d. observations on an RV X and sort them into l bins depending 
on their values. 

We define an RV X;; as 


xê 1, if the jth observation of X is in bin 7 
a 0, else ; 


We define P[X;; = 1] £ pi independent of j for i = 1,...,1 because of the i.i.d. constraint. 
The RVs 


A n . 
Y= ya tint = 1,...,1 


denote the number of outcomes in the bin i = 1,...,/ from n trials. Note that an Y,=n 


and an pi = 1. The reader will recognize this as the multinomial law discussed in Section 
4.8, that is, 


n! 


— — — = . = ri T2... rt 
PY = r1, Y2 = T2,..., Yı =r] = P(r;n, p) = rilral -o rl”? P2 Pi 


GEED) a.” 


, when n >> 1. 
(2rn)!-1pip2 +++ pt 
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n 


l ij . 
The pdf for the jth trial is P; = [] pz”, where Li Zij =1, Di p: = 1l, and Tij i 
i=1 


I> 


restricted to 0 or 1. The likelihood function associated with n repeated trials is L(p) 


l l l 
L(pi, Pi) = TeX II pž” use [I px" = I] 2: '- Under Ay : Pi = poi,t = 1,...,4, the 
i=l i=l i=l 
l 
local maximum of the likelihood function, Lr m, is merely L(po) = II pen nl pat? -II pet 
i i=) 


=H poi . The global maximum of the likelihood function is obtained by differentiation 
with ‘respect to the p;,i = 1,...,1, while recalling that we pi = 1. The result is fp; = 
l 
Y,/n,i = 1,...,l. Thus, Lam = L(Yi/n, Y2/n, ++- ,¥i/n) = J] (¥i/n)%. Finally, recalling 
i=1 


that an Y; = n, we find that the generalized likelihood ratio is 


Y; 
l i kd 
a= JE, (2) (7.4-2) 


and the critical region is 0 < A < àe. To compute the critical region at a specified level 
of significance, we need the distribution of A. However, the exact distribution of A under 
H; for an arbitrary value of n is difficult to obtain. It is shown elsewhere [7-1] that —2 In A 
under the large sample assumption is approximately y7_1. 

We consider here another approach. From Equation 7.4-1 we see that the Y;,7 = 1,...1, 
under the large sample assumption can be approximated by Normal RVs N(np;,npi),i = 


1,...1, while the U; 4 Yi>np: i= 1,...1, are approximately standard Normal. Now consider 


Vrp 
the test statistic 
i — NPoi 
7.4-3 
vey (Mae). (7.43) 


which is called the Pearson test statistic, and accepting or rejecting a hypothesis based on 
the size of V is called Pearson’s test or the Chi-square test [7-16] to [7-20]. Pearson’s test 
statistic has the form of a x? RV with l degrees of freedom but, in fact, has only |—1 degrees 
of freedom because Y; = n — yi Y; is completely specified once the Y1, Y2,...,¥i-1 are 
specified. Now, if the Y; come from a population with probabilities poi = 1,...,1, we 
expect that a realization of V will be small. However, if the Y; come from a population 
with probabilities p;,i = 1,...,1, where at least two of the p; are significantly different from 
the corresponding pp;, we expect realizations of V to be large. We can demonstrate this by 
computing E[V] under Hı and Hz. Under H, we compute E[V|H,] = 1— 1 (see Problem 
7.24). However, under H2 we compute E[V|H] as 


E(V|H2] ~ a (poi) n(pri — Poi)? (7.4-4) 


when n is large (see Problem 7.25). Clearly E[V|H»| can become arbitrarily larger than 
!—1 when at least some of the pı; are different from pp. An exact computation of E[V|H2] 
would show that it can never be smaller than l — 1. 
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Returning to the test statistic in Equation 7.4-3, that is, 
i "Poi ? 
iLa (ae) 
we note that under H; it is x?_,. To find the constant c that determines the critical region 
{V > c} at significance a, we solve JE fx2(z;l — 1)dx = a or, equivalently, 1 — œ = 


Fæ(c;l—1). So we find that c = £1-a, the 1—a percentile point of x7_,- Thus our criterion 
becomes: accept Hı if V < xz1_q, else reject Hy. 


Example 7.4-1  — SSS 
(fairness of a coin) We wish to test the hypothesis H, that po) = P[heads] = 0.5 = poz = 
P{tails] at a level of significance a = 0.05. We flip the coin 100 times and observe 61 heads 


and 39 tails. Then from 
— NPoi 
V= a5~ ; 
i=1 (a V1D0i y 


1 
we obtain V’ = eis! — 50]? + + 05x 109 89 - 50]? = 4.84. We compute the critical 


value from 0.95 = F,2(Z0.95; 1), which yields zo.95 = 3.84. Since V’ = 4.84 > 3.84 we reject 
the hypothesis that ‘the coin is fair. 


Example 7.4-2 — — — > S 
(fairness of a die} We wish to test the hypothesis, at significance 0.05, that a six-faced die 
is fair. We let Y;,i = 1,...,6, denote the number of times face i shows up. We cast the 
die 1000 times and observe Yy = 152, YZ = 175, Y3 = 165, Yj = 180, Y’ = 159, Y; = 171. 
Then 
1 

V= 167 [167 — 152)” + (167 — 175)? + (167 — 165)? + (167 — 180)? 

+(167 — 159)? + (167 — 171)?] = 3.25. 


The degree of freedom is five so we solve 0.95 = F (£0.95). This yields £o.95 = 11.1 and 
since 3.25 < 11.1 we accept the hypothesis that the die is fair. 


Example 7.4-3 —— Z S 
(test of Normality) We wish to determine whether data are from a standard Normal N (0, 1) 
population. We let Hı be the hypothesis that X is a distributed as a standard Normal 
N(0,1) and Hz be the alternative that X is not distributed as N (0,1). We use differences 
of the cumulative Normal distribution for the {po:} as follows: 

po  Fsn(—2.0) = 0.023; po? 2 Fsn(—1.5) — Fsy(—2.0) = 0.044; po3 £ Fsy(—1.0) — 
Fsn(—1.5) = 0.092; pos £ Fsn(—0.5) — Fsn(—1.0) = 0.145; pos £ Fsn (0) — Fsw(—0.5) 
0.1915; pos = Fsn(0.5) — Fsn(0) = 0.1915; p07 & Fen (1.0) — Fsn(0.5) = 0.15; pos 
Fsn(1.5) — Fgn(1.0) = 0.092; po & Fsw(2.0) — Fsn(1.5) = 0.044; pono & Fsn(oo) — 
Fsn(2) = 0.023. 


> |l 


II 
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In a 1000 observations we observe the following realizations: 


in the interval(—oo, —2] : Y7 = 19 
in the interval(—2, —1.5] : Yj = 42 
in the interval(—1.5, —1] : Y3 = 96 
in the interval(—1, —0.5] : Y% = 135 
in the interval(—0.5, 0] : Yg = 202 
in the interval(0, 0.5] : Yg = 193 
in the interval(0.5, 1] : Y7 = 155 
in the interval(1, 1.5] : Yg = 72 

in the interval(1.5, 2] : Yg = 53 

in the interval(2, co] : Yip = 33 


Aco [Yi — 1000po: 
W V= | — 
e use Ziz ( Vv 1000p; 
true. From the given data compute V’ = 12.9. Since zo.95 = 16.92 and 12.9 is less than 
16.92 we accept the hypothesis that the data are Normally distributed. 


2 
) as the test statistic and observe that V is y@ if Hı is 








We can use Pearson’s test statistic to test whether two unknown probabilities are equal 
even if no other prior information such as means and variances is available. For example 
we test two brands of printing paper in printing presses: Brand A clogs the presses six 
times in 150 trials while brand B clogs the presses 25 times in 550 trials. Are brands A 
and B equally likely to clog the presses? Two speech recognition programs are available 
for purchase. Assuming the same speaker, we find that speech recognition program SR1 
mistakes 61 words out of 250 while SR2 mistakes 30 words out of 110. Are both programs 
equally effective? In the framework of probability theory we model this as follows: We 
consider the occurrences of two events say E, and E> and we ask, Is P(E] =P[E2]? Define 
Zı as the number of times we observe the occurrence of E; in m trials and Z2 as the 
number of times we observe the occurrence of Ez in n subsequent trials. Let pı 4 P[F,] and 
p2 4 P(E]. Let m >> 1, n >> 1, then by the Central Limit Theorem Z,:N(mp,,mpiqi) 
and 4 :N(np2,np2q2). We define the normalized RVs Yı 4 Zı/m:N (pı, pıqı/M), and 


Yo 4 Z2/n:N (p2, pigi/n) and consider the RV Y 4 Yı — Y2. Since Y,and Y2 are independent 
(recall that Yı results from observations in the first m trials while Y2 results from observa- 
tions in the next n trials), it follows that Y is Normal with mean pı — pz and variance o% = 
(npıqı + mp2g2)/mn. Let H, be the hypothesis that pı = pz and the alternative H2 be that 


pı Æ pz; clearly under H1, Y:N (0, pıqı(m + n)/nm). The Pearson test statistic adapted to 


this problem is 
Y- _ 2 
v= ( (pı 2) l 
Oy 
which is seen to be x?. For a test of significance a we find the percentile z-a in 1 — 


a = F,2(%1~.;1) such that if V < r1_, we accept the hypothesis; else we reject the 
hypothesis. 
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The difficulty with this problem is that oy is unknown since pı and p3 are unknown. One 
way out of this difficulty is to replace oy by an estimate of ay based on our observations. 


Under Hı, pı = p2 £ p, and the minimum variance, unbiased estimator of p is 6 = (Zı + 


Z2)/(m+n). It follows that under H1, ôy = \/pg(m + n)/mn, where â = 1—p. We illustrate 
with two examples. l 


Example 7.4-4 
(voting patterns in different regions) In the Governor’s race in a large state, exit polls 
showed that in a rural upstate county 167 out of 211 voters voted for the Republican 
while in a downstate county that includes a large metropolitan area, 216 out of 499 voters 
voted Republican. Can we assume that the probability, pı, that an upstate voter will vote 
Republican is the same as, pz, that of a downstate voter? 





Solution Under Hı, pı = po 4 p, while under H2,p,; Æ po. Under Hı, we compute 
p = 388/710 = 0.54, @ = 0.46, oy = 0.041, Y/ = 167/211 = 0.79, YJ = 216/499 = 0.43, 
and Y’ ê Y! — Y} =0.36; hence V’ = (0.36/0.041)? ~ 77. At a significance level of a = 0.05, 
we find that 29.95 = 3.84. Since 77 > 3.84, the hypothesis is strongly rejected. 


Example 7.4-5 
(interpretation of scientific data) In an attempt to find out whether Rhesus monkeys can 
be made to distinguish and possibly attach meaning to different sounds, including spoken 
language, the following experiment was performed. A Rhesus monkey was put in an anechoic 
(external-soundproof) chamber with a computer-controlled directional loudspeaker that 
randomly emitted bursts of one of two signals: S1, a sound of the type that the Rhesus 
monkey might hear in its natural habitat; and S2, a sound characteristic of a spoken word. 
If the monkey, upon hearing a sound burst, turned its head toward the loudspeaker, it was 
taken to mean that the monkey was reacting to the sound. If the sound was of an S2 type, 
it could mean that the monkey was curious or interested in the sound and could possibly 
be trained to accept the sound as a word. However, if the monkey showed no reaction to 
the sound, it was taken to mean that the monkey attached no significance to it. From the 
researcher’s point of view the ideal case would be if the monkey never turned its head when 
exposed to an S1 sound and always turned its head when exposed to an S2 sound. Then the 
researcher could write a scholarly paper on the cognitive abilities of the Rhesus monkey and 
become famous.t We shall ignore the perplexing problem of deciding whether the monkey’s 
head has rotated enough to be scored as a “turned head.” 

In 267 bursts of “natural habitat”-type sounds, the monkey turned its head 112 times; 
in 289 bursts of spoken word sounds, the Rhesus monkey turned its head 173 times. Let pı 
denote the probability that a monkey will turn its head upon hearing a “natural habitat” 
sound and pz denote the probability that the monkey will turn its head upon spoken-worn 





tThis research is being done at a major university but the results have generated controversy in the 
scientific community. 

tA problem similar to the “checked swing” problem in baseball, where the umpire must decide whether 
a batter “followed through” or “checked his swing.” 
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sounds. Under H1, pı = P2 4 p while under H2,p,; # p2. Can we accept the hypothesis 
that the monkey shows no differentiation in its reaction to the two sounds, that is, that 


Ay, pi = po 4p, is true? 


Solution Under Hı,’ =0.51, @’ = 0.49,6’ =0.0424, Y/ = 112/267 = 0.42, YJ = 
173/289 = 0.6, and Y” £ Yi — Y3 = 0.18; hence V’ = (0.18/0.0424)” = 18; at the 0.05 
level of significance 29.95 = 3.84. Hence the hypothesis is strongly rejected. 


7.5 ORDERING, PERCENTILES, AND RANK 


For the reader’s convenience we repeat here some of the material from Section 6.8 of 
chapter 6. We make n i.i.d. observations on a generic RV X (sometimes called a popula- 
tion) with CDF Fx (x) to obtain the sample X1, X2,..., Xn. The joint pdf of the sample is 
fx(a1)x---x fx (tn), —co < z; < 00,i =1,...,n. Next we order the X;,i = 1,...,n, by size 
(signed magnitude) to obtain the ordered sample Y1, Y2,...,Y, such that —oo < Yı < Yo < 
--- < Yp < co. This is sometimes called the order statistics of the observations on X. When 
ordered, the sequence 3, —2, —9,4 would become —9, —2, 3, 4. If a sequence X1,..., X29 was 
generated from n observations on X : N(0,1), it would be very unlikely that Y, > 0 because 
this would require that the other 19 Y;,i = 2,...,20, be greater than zero and therefore all 
the samples would be on the positive side of the Normal curve. The probability of this event 
is (1/2)?°. Likewise it would be extremely unlikely that Y29 < 0 because this would require 
that the other 19 Y;, i = 1,...,19, be less than zero. As shown in Section 5.3, the joint pdf of 
the ordered sample Yj, Y2,..., Yn is n! fx (yi) x---x fx (Yn), —00 < yı < yo < °°? < Yn < 00, 
and zero else. Ordering and ranking are not the same in that ranking normally assigns a 
value to the ordered elements. For example most people would order the pain of a broken 
bone higher than that of a sore throat due to a cold. But if a physician asked the patient 
to rank these pains on a scale of 0 to 10, the pain associated with the broken bone might 
be ranked at 8 or 9 while the sore throat might be given a rank of 3 or 4. 

Consider next the idea of percentiles. We have used the notion of percentiles in other 
places in the book; here we briefly discuss it in greater detail. Assume that the IQ of a large 
segment of a select population is distributed as N(100,100), that is, a mean of 100 and a 
standard deviation of 10. Obviously the Normal approximation is valid only over a limited 
range because no one has an IQ of 1000 or an IQ of —10. The IQ test itself is valid only over 
a limited range and may not give an accurate score for people that are extremely bright 
or severely cognitively handicapped. It is sometimes said that people in either group are 
“off the IQ scale.” Still the IQ test is widely used as an indicator of problem-solving ability. 
Suppose that the result of an IQ test says that the child ranks in the 93rd percentile of the 
examinees and therefore qualifies for admission to programs for the “gifted.” How do we 
locate the 93rd percentile on the IQ scale? 


Definition (percentile): Given an RV X with CDF Fx(z), the u-percentile of X 
is the number z,, such that Fy (z,) = u. If the CDF Fy is everywhere continuous with 
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u=Fy (xu) 


x,=Fy (u) 1 





(a) 


Figure 7.5-1 (a) The standard Normal CDF; (b) the inverse function. 


continuous derivative, then z, = F’(u), where the function Fx’ is the inverse function 
associated with the CDF Fx, that is, Fy’ (Fx(tu)) = £u- The standard Normal CDF and 
its inverse are shown in Figure 7.5-1. 


Observation In the special case of the standard Normal, where Z : N(0,1), we use the 
symbol z„ to denote the u-percentile of X. If X:N (p, 07), then the u-percentile of X, £u, 
is related to z, according to 

Lu = Ut Zus. (7.5-1) 


Example 7.5-1 
(relation between xy and Zu) Show that zu = p + 2,0. 


Solution We write 


Fx (ty) =u= (200?) E exp (-} É = d ') dz 





The last line is the CDF of Z:N(0,1). Hence £u = p + zuo. We can use this result in the 
previously mentioned IQ problem. From the data we have F’y (z,,) = 0.93 = Fz(z,). From 
the table of Fsn, we get that z, ~ 1.48. Then with tu = w+ zuo = 100 + 1.48 (10), we 
get that a 93 percentile in the IQ distribution corresponds to an IQ of 115. 
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How Ordering is Useful in Estimating Percentiles and the Median* 


We briefly review here some of the material of Section 6.8 that is associated with percentiles 
and the median. 

The median of the population X is the point x9.5 such that Fx (xo.5) = 0.5. This is to 
be contrasted with the mean of X, written as ux, and defined as ux = in zfx(x)dz. The 
median and mean do not necessarily coincide. For example, in the case of fx(z) = Ae7** u(x) 
we find that py = 1/ but 20.5 = 0.69/A. To compute the mean of X we need fx (xz), which 
is often not known. The mean may seem like a rather abstract parameter while the median 
is merely the point that divides the population in half, that is, half the population is at 
or below the median and half abovet. The situation where fx (zx) is assumed to exist and 
for which we can extract or estimate parameters is called the parametric case. Typically, in 
the parametric case, we might assume a form for the population density, for example, the 
Normal, and wish to estimate some unknown parameter of the distribution, for example, 
the mean py. Then given n iid. observations X1, X2,..., Xn on X, we estimate py with 
Êx =n! E; Xi, which happens to be an unbiased and consistent estimator for the mean 
of many populations. Indeed it is the simple form of the mean estimator function fix and 
the fact that if 0% is finite then fix — wx for large n (see the law of large numbers) that 
make the mean so useful in many applications. The estimation of parameters in known 
or assumed distributions and other operations, for example, hypothesis testing involving 
known or assumed distributions, is known as parametric statistics. 

The estimation of the properties and parameters of a population without any assump- 
tions on the form or knowledge of the population distribution is known as distribution-free, 
robust, or nonparametric statistics. Statistics based on observations only without assuming 
underlying distributions are robust in the sense that the theorems and conclusions drawn 
from the observations do not change with the form of the underlying distributions. Whereas 
the mean and standard deviation are useful in characterizing the center and dispersion of a 
population in the parametric case, the median and range play this role in the nonparametric 
case. To estimate the median from X1, X2,..., Xn, we use the order statistics and estimate 
£o.5 with the sample median estimator 


Yos = Yk+1 if n is odd, that is, n = 2k +1 
(7.5-2) 
= 0.5(Yp + Yk+1) if n is even, that is, n = 2k. 


The sample median is not an unbiased estimator for xo, but becomes nearly so when n is 
large. The dispersion in the nonparametric case is measured from the 50 percent percentile 


range, that is, Azo.50 4 0.75 — 0.25, or the 90 percent percentile range, that is, Axo.99 4 
£0.95 — £0.05, Or some other appropriate range. 


* Readers familiar with the contents of Section 6.8 can skip this subsection. 

1 Thus it is not wholly accurate to say that “half the population is below and half above” the median. 
Moreover the reader should be aware that the median of a sample is typically not the same as the median 
of the whole population. 
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Yi Yr2¥3 Vays Y6 Yı Ya Yə Yio 


Figure 7.5-2 Estimated percentile range from ten ordered samples showing linear interpolation between 
the samples. To get the estimated percentile take the ordinate value and multiply by 100/11. Thus, to 
a first approximation, the 90th percentile is estimated from y;ọ while the 9th percentile is estimated 
from y,. An approximate 50 percent range is covered by y, — yp. 


Example 7.5-2 
(interpolation to get percentile points) Using the symbol a ~ 8 to mean a estimates 3, we 
have Y3 ~ 29.273, Y4 ~ £0,364, and using linear interpolation, we get £o.3 as 
(Ya — Y3)(0.3 — 4/11) 
Yq + OR OO E 
at 1/11 70.3 
Linear interpolation between ordered samples is illustrated in Figure 7.5-2. 





We discuss next a fundamental result connecting order statistics with percentiles. Once 
again the model is that of collecting a sample of n i.i.d. observations X1, X2,.-.., Xn on an 
RV X with CDF Fx(z). We recall the notation P[X; < z4] Ê u. Next we consider the 
order statistics Y} < Y2 < --- < Yn. Now consider the event {Y,<2,,}. Since Y; is the Ath 
element in the ordering of the {X;}, there are at least k of the {X;} that are less than zu. 
There may be more but certainly not less. Then, because the {X;} are i.i.d. we can use the 
binomial probability formula to compute 


PIY, < tu] =P |» least k of the {X;} are less than x, 


-57 (7) uid — u)". (7.5-3) 


Next consider the event {Yk+r > Zu}. Since Y,4, is the (k +r)th element in the ordering of 
the {X;}, there are at least n—(k+r)+1 of the {X;} that are greater than x,,. Equivalently, 
there can be no more than k +r — 1 of the {X;} less than z,. Then 

P [Yk+r > Zu] = P [no more than k +r — 1 of the {X;} are less than zu] 


=i (7 )wa —u)-i, (7.5-4) 
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The intersection of the events {Y,4,- > Zu} U {Yk < £u} is the event {Yp < ty < Yx4,}. Its 
probability is 


_ ktr-1l (n i n—i 
PY < u< Yad =D (Fea -w (7.55) 
and is independent of fx(x). The result given in Equation 7.5-5 is one of the major results 
of nonparametric statistics and has important applications, for example, estimating the 


median of a population, as we illustrate below. 


Example 7.5-3 — — — 
(How large a sample do we need to cover the median at 95 percent confidence?) We seek the 
end points Y1, Y, of a random interval [Y, Yn] so that the event {Y1 < zo.5s < Yn} occurs 
with probability 0.95. Here Yı 4 min(X,, Xo,...Xn), Yn £ max(Xı, X2,... Xn). In effect, 
how large should n be? . 


Solution We compute 


-1 
P[Y; < 205 < Yn] = >». i (7) (1/2)” = 0.95 


and find that for n = 5, P[Y; < 20.5 < Y5] 0.94. The probability that the random interval 
[¥1, Yn] covers the 50 percent percentile point is shown in Figure 7.5-3 for various values 
of n. 

Example 7.5-4 — = 
(most probable adjacent ordered pair to cover 9.33) We have the order statistics {Y}, Y2,-..., 
Yn} and wish to find the pair {Y;, Y;+1,i = 1,... n — 1} that maximizes the probability 
of covering the 33.33rd percentile point. The 33.33rd percentile point 29.33 is defined by 
1/3 = Fx (20.33). For specificity we assume n = 10. From Equation 7.5-5 we compute 


10! 
K1(10 — k)! 


and plot the result in Figure 7.5-4. Clearly the interval [Y3, Y4] is most likely to cover z9,33. 
The probability of the event {Y3 < 29.33 < Ya} is 0.26. 


P[Yk < 20.33 < Yk+ı] = (1/3)"(2/3}°-¥, k =1,...,9 


Probability that random interval covers the median 


e 12 
E 1 
3 
Es 0.8 
z2 
£ > 0.6 
£° o4 
= 
SB 0.2 
2 
a 0 
0 2 4 6 8 10 


Sample size 


Figure 7.5-3 Probability that the event {Y1 < x05 < Yn} covers the median for various values of n. 
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Probability that the 33rd percentile 
point is covered by the kth 
adjacent ordered pair 


123 4 5 6 7 8 9 
kth ordered pair 


Figure 7.5-4 Among the pairwise intervals [Yx, Yeti], the interval [Y3, Y4] is most likely to cover xo 33. 
Here n = 10. 


Example 7.5-5 
(the median and mean are not the same for the binomial) We make the somewhat trivial 
observation that for the binomial case the mean and median do not coincide. For example 
with p = 1/2 and n = 4, the mean is 2 but the median, such as it is, is somewhere between 1 
and 2. However, when n is large the median and mean approach each other and the median 
can be estimated by the mean. Indeed it can be shown that the error between the mean 
and median is proportional to (p(1 — p))", which becomes arbitrarily small for n — oo. 











Confidence Interval for the Median When n Is Large 


If n is large enough so that the Normal approximation to the binomial is valid in distribution, 


we can use 
Br, 


Pla< Sn < f] = = exp [-5u"| a. 


where 


Pla<S, < 8] = > (7 ea — p)r-i, 


n = —— =, and 7.5-6 
° V/np(1 — p) an (7-5-6) 
B a B- np+0.5 


Vnp(1 —p) 


To apply these results to the problem at hand we write 


PIY, < zoss < Yan] =P (7) (1/2)", (7.5-7) 


where we used that, by definition of the median, u = Fx(zo.5) = 1/2. The choice of 
subscripts will ensure that the confidence interval will begin at the rth place counting from 
the bottom, that is, 1, 2, 3,..., r, and end at the place reached by counting r observations 
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back from the top. For example if the 95 percent confidence calculation for n = 10 yields 
r = 3, the confidence interval begins at the third observation and ends at the eighth 
observation, both points reached by counting three places from bottom and top, respec- 
tively, that is, 1, 2, 3 (Y3) and 10, 9, 8 (Yg), and the result would appear as P[Y3 < zo.5 < 
Ya] = 0.95. 

In the binomial sum in Equation 7.5-7 we note that its mean is n/2 and its standard 
deviation is,/n/2. Hence the Normal approximation to the binomial sum in Equation 7.5-7 
for a 95 confidence interval is 


ror (n)a L f" [-1z?]dz = 0.95 
Žo 45 )O/2) ~= Tan Ja, PE a 1075, 


which, from the tables of the standard Normal distribution function Fsy(2z), yields a, = 
—1.96, 6,, = 1.96. Then it follows from Equation 7.5-6 that 


n—r—n/2+0.5 
oo a 
—1.96 = ———.—__, 
Jn/2 


which yields r = (n/2) — 1.96,/n/2 + 0.5. If r is not an integer replace r by |r], which is 
the least integer function, that is, that largest integer less than or equal to r. 


Example 7.5-6 
(95 percent confidence interval for the median for n = 20) We make 20 observations on an 
RV X and label these {X;, i = 1,..., 20}. We order them by size so that Yı < Yo < -+ < Yn. 
We use r = (n/2) — 1.96,/n/2 + 0.5 to obtain r = 6.12 and |r| = 6. Then P[¥6 < zos < 
Yi5] > 0.95. 








Distribution-Free Hypothesis Testing: Testing If Two Populations 
Are the Same Using Runs 


In general, hypothesis testing using nonparametric statistics is more involved than in the 
parametric case because of the difficulty of computing the distribution of the test statistic. 
However, when the size of the samples is large, say greater than 10, we can use the Normal 
approximation for computing the acceptance/rejection region. 

We introduce the idea of a run by considering the following simple situation. We make 
nı observations on an RV X (the “population”) with CDF Fx (x) and label these samples 
{xX 3 =1,..., nı}. After ordering them by size we create the samples {Y;, i = 1,..., 1}. 
Then we make ny observations on the same RV X and label these samples {x i = 
1, ..., n2}. We order these samples by size to obtain the ordered set {Z;,i = 1,...,n2}. Next 
we combine the two unordered sets of samples into a single set and order them by size. Then 
a typical ordered sequence might be Z1, Z2, Y1, Z3, Y2,.--; Zna, Yni—-1) Yn, , where Zi < Z2 < 
Yı < Z3 < Y < +: < Zna < Yn,-1 < Yn,. We define a run as a sequence of letters of the 
same kind bounded by letters of the other kind or the beginning/end of the entire sequence. 
Thus Z1, Z2 is the first run and its length is two. The next run is Y, and it has length one, 
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etc. The last run is Y,,-1Y,, and it has length two. We count the total number of runs 
and call this D. We note that D is a random variable. Since the two sets of samples come 
from the same population, we expect a thorough mixing of the Y’s and Z’s and therefore 
a large D. Note, however, that had the Y’s and Z’s come from different populations, D 
would, in all likelihood, be significantly reduced. For, example, suppose that we have two 
populations, say, X“) with pdf fya) (x£) = rect(xz) and X? with pdf fx (x) = rect(x—2). 
If {Y;,i = 1,...,} represent the ordered sequence from the X 0) population and {Zi = 
1,...,7} represent the ordered sequence from the X (2) population, then the ordered samples 
of the mixed sequence will appear as Y1 Y2: - - Yn Zı Z2- -* Zn and will have D’ = 2 since the 
support of their pdf’s don’t overlap. 


Example 7.5-7 SS 
(realizations of D for populations of equal and different means) We generate two sets of 
ten Normal random numbers (we show only to two places) from N (0,1) obtained from 
RANDOM.ORG, a Normal] random number provider available on the Internet. 


N(0,1) > {x: —0.19, 0.99, —1.1, —1.0, —1.3, —0.53, —0.25, 0.75, —0.25, 0.75 } 
N(0,1) > {x@): 0.68, -1.2, 0.28, 0.61, —1.2, —1.5ć, 2.1, —0.10, —0.87, 0.80 }. 


We order by size the x“ and x) sequences separately to create, respectively, the ordered 
sequences y1Y2--* Yio and z122- 219, where yı = —1.3, yio = 0.99, z1 = —1.5, and zio = 
2.1. After combining the two sequences into a single sequence and ordering all the elements 
of this sequence by size, we get the sequence 21 y1 2223424324 YaYsYeY7 25 2627 ZBY8Yo 29410210; 
which yields D’ = 11. 

We now repeat the experiment and select ten random numbers from the standard 
Normal distribution, that is, N(0, 1), and another ten from N(1, 1); the numbers are displayed 
to two places. The result is 


N(0,1) > {2): —0.079, 1.3, —0.15, 1.2, 0.75, —1.2, —0.11, —0.84, 0.35, 0.55 } 
N(1,1)— {@@: 1.2, 0.056, 0.3, —0.77, 0.95, 1.1, 0.095, —0.43, 1.1, 1.3}. 


Here the ordered y sequence is associated with the N(0,1) and the ordered z sequence 
is associated with N(1,1). After combining the two sequences into a single sequence and 
ordering all the elements of this single sequence by size, we get the sequence 


Y1Y221 22Y3YaY5 23 2425 YoY 7 Y8 26 27 2829 Y9Y10210, 


which yields D’ = 8 and has 27 percent fewer D’s than in the N(0, 1). This example suggests 
that the RV D can be used as a statistic for testing the hypothesis that the populations are 
the same. If D is large enough, say D > dp, we may conclude that the two samples come 
from the same population; else we reject that they come from the same population. The 
choice of do is discussed below. 


We test whether two samples come from the same population using the principles of hypoth- 


esis testing. We have two sets of samples: {xX i = 1,..., nı} and {x?) 4 = 1,...,ng}. 
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The null hypothesis, Hı, is that the two samples come from the same population, while 
the alternative, H2, is that they do not come from the same population or, perhaps more 
accurately, that there is not enough evidence that they come from the same population. 
The test will be based on observing the test statistic D. If the event {D > do} occurs, 
then the two samples interweave well and we may conclude that they come from the popu- 
lation. If the event {D < do} occurs, we may conclude that Hı is not supported by the 


data. Ifa ê P{rejecting H,|H true] denotes the level of significance, then a = P[D < 
do|H, true] = >> Pp(d;ni,n2), where Pp(d;n1,n2) is the probability of observing d 
all d<do 
runs in interwoven sequences of lengths nı and ng. 
Computing Pp(d;n1,n2) requires some rather sophisticated counting procedures so we 


give only the final result here. Define 


ona 
m ml’ 


Under the null hypothesis we find that 


2004/2) -1C -1/Cni t, d even 


Pp(d; ni na) = -1 n (4/2)-1 m1 —1 m2—1 1 
(CA) 2Cfa-ay/2 + Cla —ay/2C(d-1)/2)/ Cnt? odd. 


These unwieldy formulas do not yield much for the purpose of analysis and require machine 
computation to evaluate a. However, it has been shown that for nı > 10,n2 > 10, the 
distribution of D is well approximated by a Normal CDF with approximate mean and 
variance given by, respectively, 


2 2 
2 nı n2 
, Op Ani +n . 
2 D ( 1 2) G=) (G) 


Hence we approximate a = P[D < do|Hı true] = $}, P(d; nı, n2) with 
all d<do 


1 Za 1 — 
a= 5 P(d) = zS exp (327) ax, Za 2 2 n, 


all d<do 











Example 7.5-8 — > 
(run test on sameness of two populations) We request two sets of ten random numbers from 
RANDOM.ORG from a population N (1, 1) and order these by size as 


N(1,1) — {y™ : —1.4, — 0.33, 0.40, 0.44, 0.70, 0.74, 1.3, 1.3, 1.7, 2.4} 
N(1, 1) — {y® : —0.67, — 0.21, 0.38, 0.38, 0.51, 0.71, 1.4, 1.5, 2.0, 2.9}. 


For calibration we co-join these two sequences into a single sequence and order the elements 
of the sequence by size. We find that the realization Dia, = 12. We then request a set of 
random numbers from an “unknown” Normal distribution and order these by size as 


{y® : —3.8, —2.5, —0.13, 2.2, 2.8, 3.0, 3.8, 4.6, 5.5, 5.8}. 
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After interleaving these by size with the {y} sequence and counting the runs, we get 
D! = 6. We wish to test the hypothesis that the {y{} and{y©)} sequences come from the 
same population at the 0.05 level of significance. We solve 


1 20.05 1 
zS exp (~52*)az = 0.05 
T J —oo 


and find that zo.o5 = —1.65. For the given sample sizes we find that up = 10,0p = V5. 
Thus, do = 0p20.05 + Hp = 6.3 and since D’ < dg (barely), we reject the hypothesis that 
{y‘3)} comes from a N(1,1) population. Indeed, in this case, the {y)} sequence comes 
from a N(1,3) population. 


Ranking Test for Sameness of Two Populations 


Another procedure for testing the sameness of two populations is the so-called ranking test. 
Assume that we have two continuous populations X and Y with respective distribution 
functions Fx (x) and Fy (y). We wish to test the hypothesis H, : Fx = Fy versus the alter- 
native Ho:Fy # Fy. We take nı samples from X and nz from Y, co-join them, and order 
them by size. Then we assign to each element of the sequence a number denoting its place 
in the ascending order; for example, the event X; < Y; < X2 < X; < Ya < Y} < Yı would 
be designated as 
Xı Yı X2 X3 Yo Y3 Ya 
123 4 56 7° 


The number associated with each element is its rank, and the Y sequence has ranks 2, 5, 
6, and 7. Here nı = 3, ng = 4. The rank of the last element in the sequence is nı + nz and 
the rank of the first is 1. It is shown elsewhere that the RV 


rê D ranks 


Y sequence 


is a suitable test statistic to test the hypothesis that Fx(x) = Fy (x), for all z. If T is too 
large or too small, the hypothesis is rejected. To test the hypothesis at a level of significance 
a, we need the distribution of T under the null hypothesis. It is shown elsewhere ([7-22] to 
[7-24]) that when ni > 7,n2 > 7 (ideally we would want them larger), T is approximately 
distributed as N(pp,07,) with pp = no(ni + n2 + 1)/2, 0%, = nyne(ni + nz +1)/12. In the 
example above we find ur = 16,02 = 8. 


Example 7.5-9 
(ranking test on sameness of two populations) We use the {y} and {y)} sequence of 
Example 7.5-8, co-join them, and assign ranks to the elements of the ascending sequence. 
For the elements of the {y'?)} sequence, the ranks are 1, 2, 5, 13, 15, 16, 17, 18, 19, and 20; 
their sum is 126, 7 = 105, and or = 13.23. The hypothesis is that the two sequences come 
from the same population. At a level of significance a = 0.05, we solve Fr(z£o.025) = 0.025 
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and get 0.025 = —1.96 so that the critical region is {T > 131} U {T < 79}. So we accept 
the hypothesis—in error—that the two sequences come from the same population. At a 
significance level a = 0.1 the critical region is {T > 127} U {T < 87}. Marginally above 
a = 0.1 , the hypothesis is rejected. 





SUMMARY 


Hypothesis testing is a major branch of statistics that deals with decision making in a 
random (i.e., probabilistic) environment. In the beginning of this chapter we put ourselves 
in the mind of a surgeon who faced a difficult decision regarding whether to operate on one 
of his patients. By using all available prior information and seeking to minimize the average 
risk, we derived the Bayes decision rule, which—arguably—is the most rational approach to 
making decisions when available information is of the probabilistic kind rather than being 
categorical. The Bayes decision rule leads to a likelihood ration test (LRT). 

The prior probabilities (sometimes called a priori probabilities) required in Bayes testing 
may not always be available in which case the threshold in the LRT for accepting/rejecting 
the hypothesis is determined not by minimizing the average risk but by the specified error 
probability a, which is the probability of rejecting the hypothesis based on observational 
data when in fact the hypothesis is true. In the case of testing a simple hypothesis versus a 
simple alternative, the Neyman—Pearson Theorem ensures that the LRT is optimum in that 
it is the most powerful test. By this is meant that the probability of rejecting the alternative 
hypothesis when it is true is driven to a minimum. 

In a number of situations, testing a simple hypothesis versus a simple alternative won’t 
do because the hypothesis or the alternative or both involve many outcomes in the under- 
lying sample space. In that case the generalized likelihood ratio test (GLRT) is useful. We 
illustrated the GLRT with a number of examples and, in doing so, encountered such classic 
statistical tests as the F-test, the t-test, and the Pearson Chi-square test. 

We then considered ordering, percentiles, and rank and illustrated how these tools can 
be made useful in distribution-free (sometimes called robust) statistics. We illustrated these 
with hypothesis testing examples using run tests and ranking tests. 


PROBLEMS 


7.1 Prove Equation 7.1-6. 

7.2 Consider Example 7.1-1. Let the prior probabilities be P, = 0.9, Pa = 0.1. How does 
this affect the Bayes decision rule? 

7.3 Assume a Normal population X:N(,1) and a sequence of i.i.d. observations on X, 
that is, {X;:i=1,...,n}. Find the critical region for testing the hypothesis that 
Hy : p = py versus the alternative H; : u > p, at the 0.05 level. 

7.4 Show that the power P of an LRT is given by P = Plreject Hı|Hz is true]. 

7.5 Why was it not necessary to invoke the Central Limit Theorem to argue that ñ y(n) 
in Example 7.2-2 is Normally distributed? 
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7.6 


7.7 


7.8 


7.9 


We flip a coin 100 times and observe 50 + k heads and 50 — k tails. What is the 
largest value of k that will enable us to accept the hypothesis that the coin is fair at. 
a = 0.05 significance. Repeat for a = 0.01. 

A customer in a sub-freezing environment is considering buying an automobile battery 
at DBW (“Discount Battery Warehouse”). The particular battery model of interest 
is imported from one of two possible sources, say A and B, which do not share the 
same quality-control standards. The better import (A) will start the car 90 percent 
of the time in sub-freezing weather while the worse import (B) will start the car only 
50 percent of the time in such weather. There are an equal numbers of batteries from 
each source. The imports cannot be differentiated by any external visible features. 
The battery salesman will allow the customer only one try at starting his car with a 
test battery, under sub-freezing conditions, before purchase. 

We shall treat the customer’s dilemma, such as it is, from a hypothesis testing point 
of view. Let the hypothesis be Hı: the battery start-probability pı = 0.9 versus the 
alternative H3: the battery-start probability p> = 0.5. There are two actions: a, (buy 
the battery) and az (reject the battery). The loss functions are in dollars: [(a1,p,) = 0; 
(a1, p2) = 40 (money spent on a poor battery); /(a@2,pi) = 10 (passing up a good 
deal that would cost at least $10 elsewhere); l(a2, p2) = 0. Define the RV X as 

xê { 1, if battery starts the car in test trial, 
0, if battery fails to start car in test trial. 


(a) Define the four possible decision functions (d;,i = 1,...,4); 

(b) Compute the risk for each decision function (R(d;; pj), i = 1,...,4;7 = 1,2); 

(c) Plot the risk function points in a Cartesian system where the abscissa is 
R(d; pı) and the ordinate is R(d; p2). From the graph, determine which deci- 
sion function is dominated (is worse) by at least one other decision function 
and therefore is inadmissible (not worthy of consideration). 

(d) Suppose it is known that there are twice as many batteries from import B as 
from A; how would this affect your decision? 


Suppose a manufacturer of memory chips observes that the probability of chip failure 
is p = 0.05. A new procedure is introduced to improve the design of chips. To test this 
new procedure, 200 chips could be produced using this new procedure and tested. Let 
the random variable X denote the number of chips that fail out of these 200. We set 
the test rule that we would accept the new procedure if X < 5. Find the probability 
of a type I error. 
Let X:N(u,1), where y = p, = 1/2 or p = pg = —1/2. Let Hı : p = —1/2 and 
Ho: = 1/2. Define the two actions a, : accept H, (reject H2) and az : accept H3 
(reject Hı). The sample space for X is Q = {—co,co}. Let Sı = {—00,0} and 
S2 = {0,00}. Consider the two mutually exclusive events E = {X € Sı} and 
Ey = {X € So}. 
(a) Compute the four probabilities P(E,|u,;)i = 1,2; j = 1, 2; 
(b) Define the four possible decision functions dj,i = 1,...,4; 
(c) Assuming the loss functions I(a1, 41) = 0,1(a@1, H2) = 2,l(a2, p1) = 5, U(a2, p2) 
= 0, compute the risks associated with each of the decision functions in (b). 
Which decision function is inadmissible, that is, there is at least one other 
decision function that dominates (is better than) it? 
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7.14 


7.15 


7.16 


7.17 





We have two Normal populations X; : N (p1, 0°) and X2: N(p2,0°). We test Hi: p1 = 
Ha versus H3: pı # Hg at a level of significance of 5 percent. Describe the test. 

We have two Normal populations X, :N(m,0?) and X2: N (p2,0°). We test Hı : 
Hı = H versus H3 : p) > H at a level of significance of 5 percent. Describe the test. 
Repeat Problem 7.11 with the change that Hy : pı = Ho versus H3 : py < po. 
Suppose that we have n observations X;,i = 1,2,...,n of radar signals, and X; are 
normal independently and identically distributed random variables. Under Ho, X; 
have mean py and variance o”, while under H,, X; have mean p, and variance o°, 
and 4 > fy. Determine the maximum likelihood test. 

A manufacturer is interested in the output voltage of a power supply used in a 
personal computer. The output voltage is assumed to be normally distributed with 
a standard deviation of 0.25 volts, and the manufacturer wishes to test Ho : y = 5 
against Hı : u Æ 5 using n = 16 units. 


(a) The acceptance region is 4.85 < X < 5.15. Find the type I error. 
(b) Find the power of the test for detecting a true mean output voltage of 
5.1 volts. 


Let X :N(u,1) represent a population whose mean is known to be u = 4 = 3 or 
L = H = 1. We make n iid. observations on X and call these {X;,i = 1,...,n}. 
Let Ay: p = p = 3 and H3 : u = p = 1; show that the LRT is reduced to accept 
Hy, if à > (2n)~* In(k) + 2 4 Cn, Where, as usual, js = +>; Xi. The constant 
Cn is determined by the significance a. Find a general expression for c, in terms of 
Hin, and Za, the latter being the a percentile of the N(OQ, 1) distribution. Assuming 
n =10, what is the value of cn for a = 0.01? 

(continuation of Problem 7.15) In Problem 7.15 treat n as an unknown and calculate 
the value of n needed to obtain a = 0.02 and 8 = 0.01 simultaneously. 
(continuation of Problem 7.16) Keeping a at a = 0.02 show that the number of 
samples needed to achieve a given power follows the graph below. (Hint: Use 
NORMINV (probability, mean, standard deviation) in Excel ™.) 


Number of samples versus 


power 
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a 
—_ 5 
cs 
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Power = 1-8 
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(F-test for comparing variances) The F-test is useful in testing whether the variances 
(or standard deviations) of two Normal populations are the same. Typically we test 
the hypothesis Hı : 0, = a2 versus Ho : 0, Æ o2. The F-test can be done online by 
entering the data from two Normal populations N (4,02) and N (12,02) and taking 
the ratio of the sample variances. Thus, assume we have m samples from popula- 
tion P1 {X1;,i = 1,...,m} and n samples from population P2 {Xo,;,j = 1,...,n}. 
We do not mix the samples because it is important to keep the sample variances 
independent of each other. One of several programs will compute from the input real- 
izations {%1;,i = 1,...,m} and {z2;,i = 1,...,n} the numerical sample variances, 
often denoted by the symbols s2 and s2, as s? Ê (m—1)~? rn (zu — £1)? (degrees 
of freedom DOF = m — 1) and s2 Ê (n — 1)7? Lia (£2; — Ze)? (degrees of freedom 
DOF = n — 1). In-these expressions 71 = m`! Jy- v1; and Zz = n`!) j-i wa; 
are the sample numerical means. We need to specify the significance level a. The 
algorithm then proceeds as follows: (1) compute F’ = s?/s%; (2) compare F” with 
Fa/2,vi va) Where Fo /2,v;,v2 is the critical value of the F-distribution with m—1 and n— 
1 degrees of freedom and significance a. When testing Hı : 01 = a2 versus H3 : 01 > 
Og reject Hi if FY > Fay, va. 

When testing Hı : 01 = o2 versus H3 : 01 < a2 reject Hy if F” < Fi-a, v2- 

When testing Hy : o1 = a2 versus Ho : o1 Æ oe reject Hy if F < Fi—a/2, vi, vs OF 

F> Fa/21,v9- 
As an exercise, generate two sets of Gaussian random numbers first with the same 
o and then with different o’s and test the efficacy of the F-test using an online 
calculator, for example, the BioKin statistical calculator. 
The number of defects in printed circuit boards is hypothesized to follow a Poisson 
distribution. A random sample of n = 60 boards has been collected and the following 
number of defects observed. 


No.of defects Observed frequencies 


0 32 
1 15 
2 9 
3 4 





Test the goodness of fit. 

A semi-conductor manufacturer produces controllers used in automobile engine appli- 
cations. The customer requires that the process fallout of fraction defective at a crit- 
ical manufacturing step does not exceed 0.05, and that the manufacturer demonstrate 
process capability at this level of quality using a = 0.05. The semi-conductor manu- 
facturer takes a random sample of 200 devices and finds that 4 of them are defective. 
Can the manufacturer demonstrate process capability for this customer? 
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7.21 


(F-test) We are given the following factual data from [7-6] that tests the oxygen 
assimilating capability of various levels of smokers versus nonsmokers. There are five 
categories: 


Mean respiratory Standard deviation Number of people 


Category flow rate of flow rate in category 


Nonsmokers in smoke-free 3.17 0.74 200 . 
environment (1) 


Nonsmokers in smoky 2.72 0.71 200 
environment (2) 

Light smokers (3) 2.63 0.73 200 
Moderate smokers (4) 2.29 0.70 200 
Heavy smokers (5) 2.19 0.72 200 


7.22 


7.23 


The hypothesis H; is that there is no difference in air flow among the five categories; 
the alternative is that there is at least one category whose respiratory statistics are 
significantly different from the others.t 

Compute whether to accept or reject the hypothesis at the 0.05 significance level. 
(Chi-square test) Plant biologists attempt to test Mendel’s law of hereditary by 
crossing two pea plants. According to Mendel’s law three-fourth of the offspring 
should be green (dominant color) and one-fourth should be yellow (recessive). In 
880 plants, the biologists observe 639 green seeds and 241 yellow seeds. Let Hı: green 
allelet is dominant and H3: green allele is not dominant. Determine at the 0.05 level 
of significance whether to accept or reject. the hypothesis. 


Let (X),X2,...,Xn) be a random sample of a normal random variable with mean ys 
and variance 100. Let 
Ho: p = 50 


Hı : p= m (> 50) 
and sample size n=25. As a decision procedure, we use the rule to reject Ho if Z > 52, 
where Z is the value of the sample mean X. 

(a) Find the probability of rejecting Ho: u = 50 as a function of u(> 50). 

(b) Find the probability a of a type I error. 

(c) Find the probability @ of a type II error when (i) p, = 53 and (ii) p, = 55. 


tNote that if we reject the hypothesis, we still won’t know which category (or categories) was responsible 
for the rejection. 
tA gene transferring inherited characteristics. 
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7.24 
7.25 


7.26 


7.27 


7.28 


7.29 


7.31 


7.32 
7.33 


7.34 


Show that the statistic V in the Pearson goodness-of-fit test has expectation E[V|A,] = 
l — 1 under hypothesis H4. 

Show that the statistic V in the Pearson goodness-of-fit test has expectation E[V | H2] > 
l — 1 under the alternative Ho. 

Consider the F-test in testing for the equality of two variances. Plot the test statistic 

versus the variance ratio for m = 8,n = 5. Find the critical region for significance of 

0.05. 

In testing the equality of two variances of two Normal populations with m samples 

from population P1 and n samples from population P2, show that when H3 is true 

A can be written as 


A= A(m,n) 
, (m+n)/2? 
(1 + oy Fin—1,n-1) 


where A(m,n) Ê (m + n) ™+/2m-m/2n="/2, 

Aircrew escape systems are powered by a solid propellant. Specifications require that 
the mean burning rate must be 50 cm per second. The standard deviation of burning 
rate is o = 2 cm per second. A random sample of n=25 is obtained and the sample 
burning rate z” = 51.3 cm per second is calculated. What conclusion should be drawn 
at a significance level of œ = 0.05? 

A melting point test of n = 10 samples of a binder used in manufacturing a rocket 
propellant resulted in Z = 154.2°F. Assume that the melting point is normally 
distributed with o = 1.5°F. 


(a) Test Ho: u = 155 versus Ho: u Æ 155 using a = 0.01. 
(b) Calculate the power of the test if true mean is p = 150. 


Twenty-four observations are made on a random variable X and are ordered by size 
as Yı < Yo <--- < Yoq4. Estimate the 30th percentile. 

Find a 98 percent confidence interval for the median from 25 samples. 

The mean lifetime of a sample of 100 lightbulbs produced by Lighting Corporation is 
computed to be 1570 hours with a standard deviation of 120 hours. If the president 
of the company claims that the mean lifetime E[X] of all the lightbulbs produced by 
the company is 1600 hours, test the hypothesis that EX] is not equal to 1600 hours 
using a level of significance of (a) 0.05 and (b) 0.01. 

A manufacturer of a migraine headache drug claimed that the drug is 90% effective 
in relieving migraines for a period of 24 hours. In a sample of 200 people who have 
migraine headaches, the drug provided relief for 160 people for a period of 24 hours. 
Determine whether the manufacturer’s claim is legitimate at a level of significance 
of 0.05. 
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j Random Sequences 


Random sequences are used as models of sampled data arising in signal and image processing, 
digital control, and communications. They also arise as inherently discrete data such as 
economic variables, the content of a register in a digital computer, something as simple as 
coin flipping (Bernoulli trials), or the number of packets on a link in a computer network. 
In each case, the random sequence models the unpredictable behavior of these sources from 
the user’s perspective. In this chapter we will study the random sequence and some of its 
important properties. As we will see, a random (stochastic) sequence can be thought of 
as an infinite dimensional vector of random variables.t As such it stands between finite 
dimensional random vectors (cf. Chapter 5) and continuous-time random functions, called 
random processes, to be studied in the next chapter. 

Another way to generalize the random vector is by doubling the number of index para- 
meters to two, thereby creating random matrices, which have been found useful as mathe- 
matical models in image processing. When these random matrices grow in size, in the infinite 
limit we have a two-dimensional random sequence, used in many theoretical studies in image 
and geophysical signal processing. While we will not study image processing here, many of 
the basic concepts of random sequences carry over to the two-dimensional] case. Three- and 
four-dimensional random sequences have been found useful models of unpredictable aspects 
in video and other spatiotemporal signals. 


tIn the real world all sequences are finite. However, as long as the real-world sequences are long compared 
to internal correlations, the infinite length model does not significantly detract from accuracy except when 
we are at the very beginning or end of the real-world sequence. 
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In the course of developing this material we will have need to review and extend some of 
the basic material presented in Chapter 1 on the axioms of probability. This is because we 
must now routinely deal with an infinite number of random variables at one time, that is, a 
random sequence. We start out this study by offering a definition of the random sequence 
followed by a few simple examples. 


Definition 8.1-1 Let (Q,¥% P) be a probability space. Let ¢ € Q. Let X[n,¢] be a 
mapping of the sample space 2 into a space of complex-valued sequences on some index set 
Z. If, for each fixed integer n € Z, X[n,C] is a random variable, then X[n, Ç] is a random 
(stochastic) sequence. The index set Z is all the integers, —oo < n < +00, padded with 
zeros if necessary. W 


See Figure 8.1-1 for an illustration for sample space Q = {1,...,10}. We see that 
X([n,¢] for a fized outcome Ç is an ordinary sequence of numbers, that is, a determin- 
istic (nonrandom) function of the discrete parameter n. We often refer to these ordinary 
sequences as realizations of the random sequence, or as sample sequences and denote them by 
X¢[n] or merely by z[n] when there is no confusion. Thus, ten sample sequences are plotted 
in Figure 8.1-1, one for each outcome Ç € Q. On the other hand, for n fixed and Ç variable, 
X[n, ¢] is a random variable.! Thus the collection of all these realizations, —oo < n < +00, 
along with the probability space, is the random sequence. We shall often, but not always, 


Xin, 0) 


100 





Figure 8.1-1 Illustration of the concept of random sequence X(n,¢), where the ¢ domain (i.e., the 
sample space 2) consists of just ten values. (Samples connected only for plot.) 


tElementary probability texts talk about an i.id. sequence of RVs denoted by Xn(¢). Our random 
sequence however, allows the added complication of dependence among these RVs. 
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denote the random sequence by just X[n]. We retain the notation X[n,¢] when its use 
helps to clarify a point on the outcomes ¢ of the underlying sample space 2. Note that we 
use square brackets around the time argument n here, as is the convention in discrete-time 
signal processing. 

We give the following simple examples of random sequences: 


Example 8.1-1 0 — SSS 
(separable random sequence) Let X [n, ¢] ax (¢)f[n], where X(¢) is a random variable and 
fin] is a given deterministic (ordinary) sequence. Such a random sequence is the separable 
product of a random variable (function) and an ordinary sequence. We will also write 
X[n] = X f[n], suppressing the outcome ¢ variable, as is the custom for random variables. 
We see that all the sample sequences are just scaled versions of one another, with the scalar 
being the random variable X. 


Example 8.1-2 0 — SSS 
(sinusoid with random amplitude and phase) Let X[n, <] 4 A(C¢) sin(an/10 + @(¢)), where 
A and © are random variables defined on a common probability space (Q, Z P), alternately 
written X|[n] = Asin(an/10 + 9). 





These two simple random sequences are made from deterministic components, but they 
are also “deterministic” in another way. They have the unusual property, from a proba- 
bilistic standpoint, that their future values are exactly determined from their present and 
past values. In Example 8.1-1, once we observe X[n] at any fixed value of n, say n = 0, then, 
since the ordinary sequence f[n] is assumed to be known and nonrandom, all of the random 
sequence X[n] becomes known. We see that the random sequence X|n] is conditionally 
known given its value at n = 0. The situation in Example 8.1-2 is just slightly more compli- 
cated but the same approach suffices to show that given two (nondegenerate) observations, 
say at n = 0 and n = 5, one can determine the values taken on by the random variables A 
and O; then the sequence X [n] becomes conditionally known or perfectly predictable given 
these observations at n = 0 and n = 5. These deterministic random sequences would not be 
good models for noise on a communications channel because real noise is not so easily foiled. 

In the next example we see how a more general but still “deterministic” random 
sequence can be made out of a random vector. 


Example 8.1-3 
(random sequence with finite support) Let X[n,¢] be given by 


Xi, 4 { Xa. sms, 





Since X[n] = 0 except for n € [1, N], we say X[n] has finite support. Because of this 
finite support property, we can model this random sequence by a random vector X = 
(Xi, X2,..., Xn)? and then use the rich calculus of matrix algebra, for example, covariance 
matrices and linear transformations, as presented in Chapter 5. Many random sequences 
can be approximated this way, although note that we would have to consider the limiting 
behavior of such X, as N — oo, to model a general random sequence. 
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n=0 n=1 n=2,; n=3 Tree level 


Figure 8.1-2 Tree diagram for discrete amplitude random sequence. 


Example 8.1-4 — > S o 
(tree diagram for random sequence) Let the random sequence X [n] be defined over n > 0, 
and take on only M discrete values, 0, 1, 2,..., M — 1. Further assume the starting value is 
pinned at X[0] = 0. Then we can illustrate the evolution of the sample sequences of this 
random sequence with a tree diagram, with branching factor M at each node n = 0,1, 2,... 
as illustrated in Figure 8.1-2. 

At each level n, of the tree, the node values give possible sample sequence values z[n], 
with branch index i = 0,..., M — 1. The sample sequences are identified by the sequence of 
node values of a path through the tree starting from the root node n = 0. If we identify the 
path string 7,773... with the base: M number 0.717973 ..., we can call this point the outcome 
Ç € [0,1] = Q, the sample space. Finally we can label the branches with the conditional 
probability P[X|n] = m,|{X|k] on same path for k < n —1}], which in Figure 8.1-2 is 
denoted as Plin|in—-1tn—2...%,0]. Then the probability of any node value at tree level n is 
just given by the product of all the probability branch labels back to the root node along 
this path. Note that all sample sequences that agree up to time n will correspond to a 
neighborhood in the sample space 2 = [0, 1] of radius }M—". 


This example also has shown how to construct a consistent underlying sample space in 
the common case where we are given just the probability distribution information about the 
set of random variables that make up the random sequence. Note that when the random 
variables are all independent of one another, that is, jointly independent, and this probability 
distribution doesn’t change with time, the branch labels in the tree are all the same, and in 
effect, the tree collapses to one stage. This is the situation called a sequence of i.i.d. random 
variables in probability theory. Generalizing this slightly we have the following definition. 


Definition 8.1-2 An independent random sequence is one whose random variables at 
any time m1, 72,...,2N are jointly independent for all positive integers N. E 


tFor example let M = 8 and consider the base 8 number 0.1200...0.... This implies that X[1] = 
1, X[2] = 2, and all subsequent values are 0. 
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Figure 8.1-3 Example of a sample sequence of a random sequence. (Samples connected only for plot.) 


Independent random sequences play a key role in our theory because they are relatively 
easy to analyze, they form the basis of more complicated and accurate models, and it is easy 
to get approximate sample sequences using random number generators on computers. Also 
when the discrete data arises by sampling continuous-time data, statistical independence 
often is a good approximation if the samples are far apart. 

Figure 8.1-3 shows a segment from a real noise sequence, and Figure 8.1-4 shows a close- 
up portion revealing its discrete-time nature and detailed “randomness.” This segment could 
have been taken from anywhere in the noise sequence and the statistical properties would 
have been the same. This remarkable property hints at some form of “stationarity” which 
will shortly be defined (Definition 8.1-5). Note that successive random variables, making 
up this segment, do not appear to be independent. Rather they are evidently correlated, 
necessitating in general an Nth-order probability distribution to statistically describe just 
this segment of this noise sequence. Continuing in this way, we would need an infinite-order 
CDF to characterize the whole random sequence! 

In order to deal with infinite length random sequences, we may have to be able to 
compute the probabilities of infinite intersections! of events, for example, the event {X[n] < 
5 for all positive n}, which can be written as either re {X[n] < 5} or, by De Morgan’s 
laws, in terms of the infinite union (UL; {X [n] > 5})°. This requires that we can define 
and work with the probabilities of infinite collections of events, which presents a problem 
with Axiom 3 of probability measure: That is, for AB = ¢ the null set, 


P[AU B] = P[A]+ P[B] (Axiom 3). (8.1-1) 


tPlease review Section 1.4 on the definition of infinite intersections and unions. The concept is simple 
but often misunderstood. 
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Figure 8.1-4 Close-up view of portion of sample sequence. 


By iteration we could build this result up to the result 


N N 
P IU An = X P[An]; 


for any finite positive N, assuming A;A; = ¢ for all i # j. This is called finite additivity. 
It will permit us to evaluate limy_.. PIUx_, An], but what we need above is PUZZ] An], 
where A, & {X|n] > 5}. For general functions these two quantities might not be the same, 
that is, limy—of(tn) # f(limn—.. Zn). For this interchange of limiting operations to 
be valid, we need some kind of continuity built into probability measure P. This can be 


achieved by augmenting or replacing Axiom 3 by the stronger infinitely (countably) additive 
Axiom 4 given as 


Axiom 4 (Countable Additivity) 


P Ü an = 5 Pjan], (8.1-2) 


for an infinite collection of events satisfying A4; A; = ¢ for i # j. W 


Fortunately, in the branch of mathematics called measure theory [8-1] (see also 
Appendix D), it is shown that it is always possible to construct probability measures satis- 
fying the stronger Axiom 4. Moreover, if one has defined a probability measure P satisfying 
Axiom 3, that is, it is finitely additive, then the Russian mathematician Kolmogorov [8-2], 
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often referred to as the father of modern probability, has shown that it is always possible 
to extend the measure P to satisfy the countable additivity Axiom 4. We pause now for 
an example, after which we will show that Axiom 4 is equivalent to the desired continuity 
of the probability measure P. Henceforth, we will assume that our probability measures 
satisfy Axiom 4, and say they are countably additive. 


Infinite-length Bernoulli Trials 


Let Q = {H,T}, i.e. two outcomes Ç = H and T, with P[H] = p, with 0 < p < 1, and 
P[T|=¢ ĉ 1 — p. Define the random variable W by W(H) £1 and W(T) Ê 0, indicative 
of successes and failures in coin flipping. 

Let Qn be the sample space on the nth flip (the nth copy of Q) and define a new 


event space as the infinite cross product! Qao 4x ~ 19n. This would be the sample space 
associated with an infinite sequence of flips, each with sample space 2,,. We then define 


the random sequence Wn, ¢] 4 W (Cn), thus generating the Bernoulli random sequence 
Win], n > 1. Here the outcome ¢ is given as the outcomes at the individual trials as 
¢ = (Cy, Ces aan) Cn wees) 

Consider the probability measure for the infinite dimensional sample space Qao. Letting 
A, denote an event! at trial n, that is, A, € Z, where &, is the field of events in the 
probability space (Qn, Æ, P) of trial n, we need to have ae An as an event in o, the 
o-field of events in Qao. To complete this field of events, we will have to augment it with 
all the countable intersections and unions of such events. For example, we may want to 
calculate the probability of the event 


{W[1] = 1, W[2] = 0} u {W[1] = 0, W[2] = 1}, 


which can be interpreted as the union of two events of the form F2; An; that is, {W[1] = 
1, W[2] = 0} = NZ An with Ai = {W[1] = 1}, A2 = {W[2] = 0}, and A, = Qn for 
n > 3. Hence o must include all such events for completeness. To construct a probability 
measure on Na, we start with sets of the form Aj = [gı An and define in the case of 
independent trials, 


Poo [Aco] = Il P|An]. 


We then extend this probability measure to all of Zo by using Axiom 4 and the fact that 
every member of o is expressible as the countable union and intersection of events of the 
form Z; An. We have in principle thus constructed the probability space (Ra, Zo, Pœ) 
corresponding to the infinite-length Bernoulli trials, with associated Bernoulli random 
sequence 

Win,c]=W(C,), nel. 


tHere the infinite cross product X Nn simply means that the points in Qa consist of all the infinite- 
length sequences of events, each one in Nn for some n. Thus if outcome ¢ € Na, then ¢ = (C1, 2,3- -), 
where outcome Ç, is in Qn for each n > 1. (The finite-length case of Bernoulli trials was treated in 
Section 1.9.) 

tMost likely just a singleton event, that is, just one outcome, in this binary case. 
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We have just seen how to construct the sample space Qao for the (infinite-length) 
Bernoulli random sequence, where the outcomes ¢ are just infinite-length sequences of “H” 
and “T.” This W|n] is thus our first nontrivial example of a random sequence. However, 
it may seem a bit artificial to regard each random variable W[n, ¢] as a function of the 
infinite dimensional outcome vectors that make up the elements in the sample space Naw 
It seems as though we have unnecessarily complicated the situation, after all W |n, ¢] is just 
W (Cn). To see that this notational complication is unavoidable, let us turn to the commonly 
occurring model for correlated noise, 


n 
X[n] = 5 a” "W |m], for n > 1, (8.1-3) 
m=i 
where W [n] is the Bernoulli random sequence just created. Writing the filtered output X [n] 
for each outcome ¢, 


X[n,¢] = > a TW (Cm) 


we see that each X[n, ¢] is a function of an ever-increasing (with n) number of components 
of ¢, that is, the value of X[n] depends on outcomes ¢€,,¢5,...,¢,.- If we just dealt with 
each fixed value of n as a separate problem, that is, a separate sample space and probability 
measure, there would be the unanswered question of consistency. This is where, in practice, 
we would call on Kolmogorov’s consistency theorem to show that our results are consistent 
with one sample space 2... which has (infinite-length) outcomes ¢.t 


Example 8.1-5 — > — — > 
(correlated noise) Consider the random sequence in Equation 8.1-3, with |a| < 1. We take 
the Bernoulli random sequence W [n] as input, that is, W[n] = 1 with probability p, and 
W [n] = 0 with probability q 2 1 — p. We want to find the mean of X [n] at each positive n. 
Since the expectation operator is linear, we can write 


E(x} = BY È arwi} 


m=i 


È a" E{W|m]} 


— > anp = p > anm 
m=1 m=1 
ni, (1 — a”) 
=p }, a™ =p 
m’=0 (1 7 a) 


tThe use of bold notation for Næ, Ç, Poo is rather extravagant but was introduced to avoid confusion. 
Clearly, Qoo is not the same as im, Qn. Each outcome in Qn is either a {H} or a {T} no matter how large 
n gets. On the other hand, the outcomes i in Ç € Qo are infinitely long strings of H’s and T’s. In the future 


we shall dispense with the bold notation even if 2 is generated by an infinite cross product and its elements 
(outcomes) are infinitely long strings. 
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The random sequence X(n] thus created is not a sequence of independent random 
variables, as we can see by calculating the correlation 


E{X[2|X[1]} = E{(aW[1] + W[2]) w11]} 
= aE {W?[1]} + F{W[2}E {w1} 
=ap+p* 
# (at 1p? = E{X[2]}£{X[I}}. 


The random variables X[2] and X[1] must be dependent, since they are not even 
uncorrelated. . 

However, since the W[n] are uncorrelated we can easily calculate the variance 
Var{ X[n]} as 


Var {X[n]} = > Var {a”-™W[m]} 


n 


= 5 a? =m) Var {W [m]} 


(1 — a?") 
= “aaa Pt 


The dynamics of this random sequence can be modeled using a difference equation. Since 
X[n = 1] = EZ a" 1-™ Wm], it follows that X[n] = aX[n — 1] + W[n], a result that 
clearly exhibits the dependence of X [n] on its immediate neighbor X [n—1].t Thus, correlated 
noise X[n] can be generated from the independent sequence W[n] by filtering with the 
configuration shown in Figure 8.1-5. From Equation 8.1-3 we see that for large n, X[n] is 
the sum of a large number of independent random variables. Hence by the Central Limit 
Theorem it will tend to a Gaussian distribution, n — oo, with mean pia and variance 
er 


Zero-mean, correlated, Gaussian noise can be generated using the same model. Thus, 
with W[1],W[2],...,W|n],... denoting zero-mean, independent, identically distributed, 





Win) Xin] 





Gain a 


Figure 8.1-5 A feedback filter that generates correlated noise X[n] from an uncorrelated sequence . 


Win. 


tSuch explicit dependence in the equation like this is sometimes called direct dependence. 
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Gaussian random variables with N(0,0%,), the random sequence X [n] = Pm; &”7™W |m] 
will be zero-mean, Gaussian with variance 


1 
1 


2n 
a 2 


Var{ X [nJ} = =" 





where oł, = Var{W[n]}. Here too, the sequence produced by the filter is correlated since 
E{X(2]X[1]} = aE{W°[1]} = ach, # E{X[2]}£{X[1]} = 0. 





The next example gives a MATLAB method to construct realizations of the Bernoulli 
random sequence and then passes the resulting sample sequences through a first-order filter 
to generate sample sequences of a (more realistic) correlated random sequence. 


Example 8.1-6 
(sample sequence construction) We use MATLAB to construct a sample sequence of W [n]. 
The MATLAB program 





u = rand(40,1); 
w= 0.5 >= u; 
stem (w), 


uses the built-in function “rand” to generate a 40-element vector of uniform random vari- 
ables. The second line sets the vector elements w|n] to 1 if u[n] > 0.5, and to 0 if u[n] < 0.5. 
So w[n] is a sample sequence of the Bernoulli random sequence with p = 0.5. The corre- 
sponding MATLAB plot is shown in Figure 8.1-6. 


yin] 
O 
a 


0 5 10 15 20 25 30 35 40 
Time axis n 


Figure 8.1-6 A sample sequence w{n] for the Bernoulli random sequence Win]. 
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Figure 8.1-7 First 40 points illustrating startup transient. 


To model the sample sequences of X [n], which we denote x[n], we can filter the sequence 
wjn] with the filter, 


zin] = az[n — 1] + w[n], 


which has impulse response h[n] = a"u[n] to realize the linear operation of Equation 8.1-3. 
The corresponding MATLAB m-file fragment is 


b = 1.0; 
a = [1.0 -alpha]; 
x = filter(b,a,w); 
stem (x) 


The result for a = 0.95 and a 400-element vector was computed. Figure 8.1-7 shows 
the startup transient for the first 40 values. Figure 8.1-8 shows a sample of the approximate 
steady-state behavior starting at n = 350 and plotted for 50 points. Note the sample average 
value that has built up in x[n] over time. 





Note that the random sequence X [n] has typical noise-like characteristics. The filter has 
correlated the random variables making up X[n] so that sample sequences x{n] look more 
“continuous.” This simple example is called an autoregressive (AR) model and is widely 
used in signal processing to model both noises and signals. Note that the deterministic 
defect of the initial examples has now been removed. The reason is that the Bernoulli input 
sequence provides a new independent value for every sample, ensuring that the next sample 
cannot be perfectly predicted from the past. : 
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Figure 8.1-8 A segment of 50 points starting at n = 350. 


Continuity of Probability Measure 


When dealing with an infinite number of events, we have seen that continuity of the proba- 
bility measure can be quite useful. Fortunately, the desired continuity is a direct consequence 
of the extended Axiom 4 on countable additivity (cf. Equation 8.1-2). 


Theorem 8.1-1 Consider an increasing sequence of events Bn, that is, Bn C Bnii 
for all n > 1 as shown in Figure 8.1-9. Define Boo £ UZ, Bn; then limp_.oo P[Bn] = 
P[Bool- 


Proof Define the sequence of events An as follows: 


Ay 2 By 


An 2 BB 4, n>1. 





Figure 8.1-9 Illustrating an increasing sequence of events. 
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The An are disjoint and ÎL; An = UŽ; Bn for all N. Also By = UL Bn because the 
Bn are increasing. So 





N N N 
P[By] =P IU a, =P|U an = >> P[An], 
n=l n=1 . n=l 
and 
N 
lim P[By] = lim X` P[A,] 
n=1 


+00 
=> P[A,] by definition of the limit of a sum, 


n=l 


oO 
=P Č an by Axiom 4, 
n=l 
= P[|Bæ] by definition of the An. 
This last step results from US} An = UX; Bn 2 Boo M 


Corollary 8.1-1 Let B, be a decreasing sequence of events, that is, Bn D Bn+1 for 
all n > 1. Then 


lim P[Bn] = P[Bool, 
n—00 
where 
A co 
Boo = M2r 
Proof Similar to proof of Theorem 8.1-1 and left to the student. E 


Example 8.1-7 


Let B, £ {X[k] < 2 for 0 < k < n}, for n = 0,1,2,.... In words, B, is the event that 
X[k] is less than 2 for the indicated range of k. Clearly Bn+1 is a subset of Bn, that is, 


Bn+1 C Bn for all n = 0,1,2,.... Also if we set Boo & {X|k] < 2 for all k > 0}, then 
Bæ = NP, Bn. So we can write, by the above corollary, 


P[Boo] = lim P[By] 


= lim P[X(0] < 2,...,X[n] <2]. 


Thus, the corollary provides a way of calculating events involving an infinite number of 
random variables by just taking the limit of the probability involving a finite number of 
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random variables. This type of limiting calculation is often performed in engineering anal- 
yses, and typically without explicit justification (i.e., without worrying about the consistency 
problem mentioned earlier). In this section we have seen that the correctness of the approach 
rests on a fundamental axiom of probability theory, Axiom 4 (countable additivity). 





We next use the continuity of the probability measure P to prove an elementary fact 
about CDFs. 


Example 8.1-8 ———— ~ n a 
(continuity on the right) The CDF is continuous from the right; that is, for Fx(z) = 
P[X(¢) < 2] [cf. Property (iii) of Fx in Section 2.3], we have 


1 
lim Fy (2+ z) = Fx(z). 
noo n 
To show this, we define 


B, ê fe xO <z+ż) 


and note that B,, is a decreasing sequence of events, where Boo 4 Mr Ba = {¢: X(¢) < r} 
and 


1 
Fy (z+ =] = P[B,|]. 
n 
By application of Corollary 8.1-1, we get 
1 
lim Fx (z+ x) = lim P[B,] = P[Boo] 
N00 n 100 
Statistical Specification of a Random Sequence 


A random sequence X |n] is said to be statistically specified by knowing its Nth-order CDFs 
for all integers N > 1, and for all times, n,n +1,...,n+N—1, that is, if we know 


Fx (fn, En41;En42)--+;Fn¢n—-1;7,N+1,...,.n+N — 1) 
A (8.1-4) 
= P[X[n] < tn, X[n +1] < n41,- --, X[n +N — 1] < tnyn-il, 
where the variables after the semicolon, n,n +1,...,n + N — 1, indicate the location of the 


N random variables in this joint CDF. Note that this is an infinite set of CDFs for each 
order N, because we must know the joint CDF at all times n, —oo < n < +00. Incurring 
some penalty in notational clarity, we often write the joint CDFs more simply as 


Fy (2n,2n41)---:2n4N—-1), for all n, and for all N > 1. (8.1-5) 


We also define Nth-order CDFs for nonconsecutive time parameters, 
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Fy (Enis Enas: -3 Enyin N2- -e NN). 


It may seem that this statistical specification is some distance from a complete descrip- 
tion of the entire random sequence since no one distribution function in this infinite set 
of finite-order CDFs describes the entire random sequence. Nevertheless, if we specify all 
these finite-order joint distributions at all finite times, using continuity of the probability 
measure that we have just shown, we can calculate the probabilities of events involving infi- 
nite numbers of random variables via limiting operations involving the finite-order CDFs. 
Of course, we do have to make sure that our set of Nth-order CDFs is consistent within 
itself! Sometimes it is trivial, for instance, the case where all the random variables that 
make up the random sequence are independent of one another, for example, a Bernoulli 
random sequence. 


Example 8.1-9 
(consistency) For consistency, the low-order CDF's must agree with the higher-order CDFs. 
For example, considering just N = 2 and 3, we must have 





Fx (Zn, n42; n,n + 2) = Fx (fn, 00, npn n+ ln +2), 


for all n, and for all values of zn and £n+2. Likewise, the N = 1 CDFs must be consistent 
with those of N = 2. Further the consistency must extend to all higher orders N. 


Consistency can be guaranteed by construction, as in the case of the filtered Bernoulli 
random sequence of Example 8.1-6 above. If we were faced with a suspect set of Nth-order 
CDFs of unknown origin, it would be a daunting task, indeed, to show that they were 
consistent. Hence, we see the important role played by constructive models in stochastic 
sequences and processes. 

In summary, we have seen two ways to specify a random sequence: the statistical char- 
acterization (Equation 8.1-4) and the direct specification in terms of the random functions 
X[n, ¢]. We use the word statistical to indicate that the former information can be obtained, 
at least conceptually, by estimating the Nth-order CDFs for N = 1,2,3,... and so forth, 
that is, by using statistics. 

The Nth-order probability density functions (pdf’s) are given for differentiable Fx as 


fx (2a, 2n41,---)In4n-157,24+1,...,n+N—1) 
_ ON Fx (fn, 2n41)--+)En¢N-132,N4+1,...,.N+N— 1) (8.1-6) 


LnEn+1 - - -OLn4N—1 , 

for every integer (time) n and positive integer (order) N. Sometimes we will omit the 
subscript X when only one random sequence is under consideration. Also, we may drop the 
explicit time notation and write 


fx (fn) En41;--+;2n+N-1) for Fx (En, n41,---,Entn-137,n+1,...,2+N-—1). 


We will sometimes want to deal with complex random variables and sequences. By this 
we mean an ordered pair of real random variables, that is, X = (Xp, X1) often written as 
X = Xr + 9X with CDF 
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Fx (rr, 21) Ê P[Xr < tr, Xi < zı]. 
The corresponding pdf is then 


0? Fy (zr, 21) 


fx(zr, zı) = OrROzy 


To simplify notation we will write fx(x) for fx(rr,21) in what follows, with the under- 
standing that the respective integrals (sums for discrete valued complex case) are really 
double integrals on the (zp, 21) plane if the random variable is complex.t 

The moments of a random sequence play an important role in most applications. In 
part this is because for a large class of random sequences (so-called ergodic sequences, to be 
covered in Section 10.4 in Chapter 10), they can be easy to estimate from just one sample 
sequence. The first moment or mean function of a random sequence is 


+00 
nxn] Ê E{X{n}} = J zfx(z;n)dz 


+00 
= J Infx(In)drn 


00 


for a continuous-valued random sequence X [n]. The mean function for a discrete-valued 
random sequence, taking on values from the set {£ẹ, —00 < k < +oo} at time n, is evalu- 
ated as 
+00 
extn] = E{Xþ]}= Y eeP[X(n] = a (8.1-7) 


k=—00 


In the case of a mixed random sequence, as in the case of mixed random variables, it is 
convenient to write 


+00 +2 
xin] = J zfx(z;n)dz + D zkP|X |n] = zx]. (8.1-8) 
T700 k=—0o 


Actually using the concept of the Stieltjes integral [8-3] both terms can be rewritten in the 
one form 
+eco 
uxi] = f 2dFx(ain), 
—00 

in terms of the CDF Fx(z;7n). 

The expected value of the product of the random sequence evaluated at two times 
X[k] X*[l] is called the autocorrelation function and is a two-parameter function of both 
times k and l, where —oo < k,l < +00, 


+Complex random sequences are used as equivalent baseband models of certain bandpass signals and 
noises. The resulting complex valued simulation can be then run at a much lower sample rate. 
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Rxx[k, 1] Ê E{X[k]X* i} 
+00 +00 (8.1-9) 
= J J tett fx (£k, £1; k, l)drkdz:, 


when the autocorrelation function exists (the usual case, but of course, in some cases the 
integral might not converge). Most of the time we will deal with second-order random 
sequences, defined by their property of having finite average power E{]|X|n]|?} < 00. Then 
the corresponding correlation function will always exist. Later we shall see that the conju- 
gate on the second factor in the autocorrelation function definition results in some nota- 
tional simplicities for complex-valued random sequences. We will also define the centered 


random sequence X,[n] 2x [n]— yx [n], which is zero-mean, and consider its autocorrelation 
function, called the autocovariance function of the original sequence X [n]. It is defined as 


Kxx{k,l] Ê E{(X[k] — ux [k] (XWH - px [l])*}. (8.1-10) 

Directly from these definitions, we note the following symmetry conditions must hold: 
Rxx [k,l] = Rẹ x ll, k], (8.1-11) 
Kxx[k,l] = Kýx[l, k], (8.1-12) 


called Hermitian symmetry. Also note that 
Kxxlk,]] = Rxxlk,] - wx (ely: (8.1-13) 


The variance function is defined as oå |n] Ê Kxx [n,n] and denotes the average power 
in X,[n]. The power of X[n] itself has been given above and equals Rx x[n, n]. 


Example 8.1-10 
(Example 8.1-1 cont’d.) The mean function of X[n] as given in Example 8.1-1 is 


uxla] = E{X [n]; = E{Xfln]} = wx fln], 
where uy is the mean of the random variable X. The autocorrelation function is 
Rxx([k,U = E{X[k)X*[]} = E{X fk] X* f* U} 
= E{|X |P} sik] fi, 
and so the autocovariance function is given as 
Kxxlk, 1] = E{IXP SKS — lex PRS 
= E{\X? — ux} Fik] s" 
= E{|X — ux P} U 
= o% f |k] f* il, 


where 0% = Var(X). We thus see that the variance o% [n] is just o%|f[nl|?. 
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Figure 8.1-10 The T[n] are arrival times and the T[n] are interarrival times. 


We look at a sequence which fits our notions of randomness better in the next example. 


Example 8.1-11 — >>> 
(waiting times) Consider the random sequence consisting of i.i.d. random variables T[n] for 
n > 1, each with the exponential pdf of Equation 2.4-16, that is,t 


fr(tin) = fr(t) = Aexp(—A)u(t), n= 1,2,... 


Write the running sum of the 7[k] up to time n, defined as 


n 
Ti] =~ Tik], (8.1-14) 
k=1 
and consider T'[n] as a second random sequence for n = 1,2,.... It turns out that the arrival 


of random events in time is often modeled in this way. We say that T[n] is the time to the 
nth arrival or waiting time and we call the T|n] the interarrival times.+ See Figure 8.1-10. 

Later, in Chapter 9, we shall see that the important Poisson random process can be 
constructed in this way. Here we want to determine the pdf of T|n] at each n based on the 
definition in Equation 8.1-14. Using the fact that the 7[k] are independent, we can apply 
Equation 4.7-3 and conclude that the pdf of T'[n] will be the (n—1)-fold convolution product 
of exponential pdf’s. Using convolution to determine the pdf of T[2], we get 


fr(t;2) = fr(t) * fr(t) = ’t exp(—At)u(t). 


tRecall that A=1/p. 
Please regard 7 as a “capital tau” to continue our distinction between a random variable and the value - 
it takes on, that is, X = z. 
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Figure 8.1-11 A plot of the Erlang pdf for à = 1 and n = 3. 
Convolving this result with the exponential pdf a second time, we get 
1,3,2 
fr(t;3) = ZA t exp(—At)u(t). 


It turns out that the general form is the Erlang pdf, 


(At)? 
(n-1)! 





fr(t;n) = àexp(—àt)u(t). 
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(8.1-15) 


The Erlang or gamma pdf [8-4] is widely used in waiting-time problems in telecommunica- 
tions networks and is plotted via MATLAB in Figure 8.1-11 for n = 3 and à = 1.0, which is 


the waiting time for n = 3 arrivals. 


We can establish this density’s correctness by the Principle of Mathematical Induction. 
(See Section A.4 in Appendix A.) It is composed of two steps: (1) First show the formula is 
correct at n = 1; (2) then show that if the formula is true at n— 1, it must also be ‘true at n. 
Combining these two steps, we have effectively proved the result for all positive integers n. 

We see that fr(t;1) in Equation 8.1-15 is correct, so we proceed by assuming 
Equation 8.1-15 is true at n — 1. By convolving with the exponential, we can show that 


it is true at n as follows: 
fr(t;n) = fr(t;n — 1) * Aexp(—At) u(t) 


_ f (Ar)"~? 2 
= | exp(—Ar) 7° exp(—A(t — 7))dr u(t) 
o n -— 2)! 


(n — 2) 
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n—2 


= à” exp(—At) f Gi u(t) 


grat 


= X" exp(—At) in 


Using the i.i.d. property of the 7T[n], we can also compute the mean as 
urin] = nur = n(1/d) = n/r 

and variance of the sum T[n] by repeated use of property (A) of Equation 4.3-18. 
Var [T[n]] = nVar[T] = n/d?. 





We next introduce the most widely used random model in electrical engineering, commu- 
nications, and control: the Gaussian (Normal) random sequence. Its wide popularity stems 
from two important facts: (1) the Central Limit theorem (Theorem 4.7-2) assures that many 
processes occurring in practice are approximately Gaussian; and (2) the mathematics is espe- 
cially tractable in problems involving detection, estimation, filtering, and control theory. 


Definition 8.1-3 A random sequence X |n] is called a Gaussian random sequence if 
its Nth-order CDFs (pdf’s) are jointly Gaussian, for all N > 1. W 


We note that the mean and covariance function will specify a Gaussian random sequence 
in the same way that the mean vector and covariance matrix determine a Gaussian random 
vector (see Section 5.5). This is because each Nth-order distribution function is just the 
CDF of a Gaussian random vector whose mean vector and covariance matrix are expressible 
in terms of the mean and covariance functions of the Gaussian random sequence. 





Example 8.1-12 i 
(pairwise average) Let W[n] be a real-valued Gaussian i.i.d. sequence with mean pw [n] = 0 
for all n and autocorrelation function Rw [k,l] = 076[k — l], o > 0, where 6 is the discrete- 


time impulse 
Aji, n=0, 
d{n] = {0 n#0. 


If we form a covariance matrix, then, for a vector of any N distinct samples, it will be 
diagonal. So, by Gaussianity, each Nth-order pdf will factor into a product of N first-order 
pdf’s. Hence the elements of this random sequence are jointly independent, or what we 
call an independent (Gaussian) random sequence (cf. Definition 8.1-2). Next we create the 
random sequence X [n] by taking the sum of the current and previous W[n] values, 


X(n] 2 W[n|] + W[n — 1], for — 00 < n < +00. 


Here X [n] is also Gaussian in all its Nth-order distributions (since a linear transformation 
of a Gaussian random vector produces a Gaussian vector by Theorem 5.6-1); hence X [n] is 
also a Gaussian random sequence. We can easily evaluate the mean of X[n] as 
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Figure 8.1-12 Diagram of the tri-diagonal correlation function of Example 8.1-12. 


bx |n] = E{X[n]} = E{W[n]} + E{W[n — 1} 
=0, 
and its correlation function as 
Rxx|k,l) = E{X|k|X|I} 
= E{(W |k] + W[k — 1)) (W [] + WE — 1))"} 
= E{W[k]W[l]} + E{W[k]W [l — 1]} 
+ E{W [k — 1])W[l]}} + E{W[k — 1]W[i — 1)} 
= Rwwi{k,!]) + Rwwik,l — 1] + Rww[k - 1,1) + Rww[k -1,1- 1] 
= o° (d[k — I] + d[k — L + 1] + 6[k — l — 1] + ô[k — 1]). 
We can plot this autocorrelation in the (k,l) plane as shown in Figure 8.1-12 and see 
the time extent of the dependence of the random sequence X [n]. 
From this figure, we see that the autocorrelation has value 2c? on the diagonal line 1 = k 
and has value g? on the diagonal lines 1 = k + 1. It should be clear from Figure 8.1-12 that 
X [n] is not an independent random sequence. However, the banded support of this covariance 


function signifies that dependence is limited to shifts (k — l) = +1 in time. Beyond this lag 
we have uncorrelated, and hence in this Gaussian case, independent random variables. 


Example 8.1-13 
(random walk sequence) Continuing with infinite-length Bernoulli trials, we now define a 
random sequence X [n] as the running sum of the number of successes (heads) minus the 
number of failures (tails) in n trials times a step size s, 





X[n] = > Wik] with + [0] =0, 
k=1 


where we redefine W [k] = +s for outcome Ç =H and W[k] = —s for outcome ¢ =T. 
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Sample sequence x[n] 
a 


0 10 20 30 40 50 60 
Time index n 


Figure 8.1-13 A sample sequence x[n] for random walk X[n] with step size s = 0.2. 


The resulting sequence then models a random walk on the integers starting at position 
X[0] = 0. At each succeeding time unit a step of size s is taken either to the right or to 
the left. After n steps we will be at a position rs for some integer r. This is illustrated in 
Figure 8.1-13. 

If there are k successes and necessarily (n — k) failures, then we have the following 
relation: 

rs = ks — (n — k)s 
= (2k —n)s, 
which implies that k = (n + r)/2, for those values of r that make the right-hand side an 
integer. Then with P[success] = P[failure] = 4, we have 


P{X[n] = rs} = P[(n + r) /2 successes] 


= (in Foz) 27", (n+r)/2 an integer, r <n 


0, else. 


Using the fact that X[n] = W[1] + W[2] + ... + WJ[n] and that the W’s are jointly 
independent, we can compute the mean and variance of the random walk as follows: 


E{X(n}} = $ E(w} =}0=0, 
k=1 k=1 
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and 


E{X?[n]} = $ EWR] 


If we normalize X |n] by dividing yn and define 


1 
Jaxl, 


then by the Central Limit Theorem 4.7-2 we have that the CDF of X |n] converges to the 
Gaussian (Normal) distribution N (0, s?). Thus for n large enough, we can approximate the 
probabilities 


Žin] ê 


Pla < X[n] < b] = PlaV/n < X[n] < bvn] ~ erf(b/s) — erf(a/s). 


Note, however, that when this probability is small, very large values of n might be required 
to keep the percentage error small because small errors in the CDF may be comparable to 
the required probability value. In practice this means that the Normal approximation will 
not be dependable on the tails of the distribution but only in the central part, hence the 
name Central Limit Theorem. 

Note also that while X[n] can never be considered approximately Gaussian for any n 
(e.g., if n is even, X[n] can only be an even multiple of s), still we can approximately 
calculate the probability 





P{(r — 2)s < X[n] < rs] = P [p < X[n] < = 
1 r/ vn 


= — exp(—0.5v?)dv 
VZT J(r—2)/ vn l ) 


œ~ 1/y/r(n/2)exp(—r?/2n), 


where r is small with respect to yn. See Section 1.11 for a similar result. In obtaining 
the last line, we assumed that the integrand was approximately constant over the interval 


[(r — 2)/ vn, r/vn]. 


The waiting-time sequence in Example 8.1-11 and the random walk in Example 8.1-13 
both have the property that they build up over time from independent components or 
increments. More generally we can define an independent-increments property. 
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Definition 8.1-4 A random sequence is said to have independent increments if for 
all integer parameters nı < nz <... < ny, the increments X [n1], X[n2] — X[ni], X[n3] — 
X[n2],..-,X[nw] — X[nn_:] are jointly independent for all integers N>1. W 


If a random sequence has independent increments, one can build up its Nth-order 
probabilities (PMFs and pdf’s) as products of the probabilities of its increments. (See 
Problem 8.10.) 

In contrast to the evolving nature of independent increments, many random sequences 
have constant statistical properties that are invariant with respect to the index parameter n, 
normally time or distance. When this is valid, the random model is simplified in two ways: 
First, it is time-invariant, and second, the usually small number of model parameters can 
be estimated from available data. 


Definition 8.1-5 If for all orders N and for all shift parameters k, the joint CDFs of 
(X[n], X[n + 1],...,X [n+ N — 1]) and (X[n + k], X[n+k4+1],...,X[n+k+N —1])) are 
the same functions, then the random sequence is said to be stationary, i.e., for all N > 1, 


Px (fn, 2n41)---)fn4n—-131,n4+1,...,n+N—-1) 
= Fy (%n,2n41,---,2n4n-1;r tk, n+14+k,....n+N—1+k) (8.1-16) 


for all —co < k < +00 and for all z,, through zn+n-1. This definition also holds for pdf’s 
when they exist and PMFs in the discrete amplitude case. E 


If we look back at Example 8.1-12, we see that X[n] and Wj{n] are both stationary 
random sequences. The same was true of the interarrival times 7[n] in Example 8.1-11, but 
the random arrival or waiting time sequence T[n] was clearly nonstationary, since its mean 
and variance increase with time n. 

Note that stationarity does not mean that the sample sequences all look “similar,” or 
even that they all look “noisy.” t Also, unlike the concept of stationarity in mathematics and 
physics, we don’t directly characterize the realizations of the random sequence as stationary, 
just the deterministic functions that characterize their behavior, i.e., CDF, PMF, and pdf. 

It is often desirable to partially characterize a random sequence based on knowledge 
of only its first two moments, that is, its mean function and covariance function. This 
has already been encountered for random vectors in Chapter 5. We will encounter this for 
random sequences when we present a discussion of linear estimation in the signal-processing 
applications of Chapter 11. In anticipation we define a weakened kind of stationarity that 
involves only the mean and covariance (or correlation) functions. Specifically, if these two 
functions are consistent with stationarity, then we say that the random sequence is wide- 
sense stationary (WSS). 


tFor example, suppose we do the Bernouilli experiment of flipping a fair coin once and generate a random 
sequence as follows: If the outcome is heads then X(n] = 1 for all n. If the outcome is tails then X[n] = W[n], 
that is, stationary white noise again for all n. Thus, the sample sequences look quite dissimilar, but the 
random sequence is easily seen to be stationary. In Chapter 10, we discuss the property of ergodicity, which, 
loosely speaking, enables expectations (ensemble averages) to be computed from time averages. In this case 
the sample functions would tend to have the same features; that is, a viewer would subjectively feel that , 
they come from the same source. 
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Definition 8.1-6 A random sequence X[n] defined for —oo < n < +00 is called 
wide-sense stationary (WSS) if 


(1) The mean function of X[n] is constant for all integers n, —o0 < n < +00, 


pxln] =p x[0] and 


(2) For all times k,l, —00 < k,l < +00, and integers n, —co < n < +00, the covariance 
(correlation) function is independent of the shift n, 


Kxx[k,l] = Kxx[k+n,l4+ n]. || (8.1-17) 


We will call such a covariance (correlation) function shift-inveriant. If we think of [k,l] 
as a constellation or set of two samples on the time line, then we are translating this 
constellation up and down the time line, and Saying that the covariance function does not 
change. When the mean function is constant, then shift invariance of the covariance and 
correlation functions is equivalent. Otherwise it is not. For a constant mean function, we 
can check property (2) for either the covariance or correlation function. 

While all stationary sequences are WSS, the reverse is not true. For example, the third 
moment could be shift-variant in a manner not consistent with stationarity even though 
the first moment is constant and the second moment is shift-invariant. Then the random 
sequence would be WSS but not stationary. To further distinguish them, sometimes we refer 
to stationarity as strict-sense stationarity to avoid confusion with the weaker concept of 
wide-sense stationarity. 


Theorem 8.1-2 All stationary random sequences are WSS. 


Proof We first show that the mean is constant for a stationary random sequence. 
Let n be arbitrary 
+00 +00 
uxi] = Eix} = [af andr = | a f(a O}de = xO), 
00 


—00 


since fx(z;n) does not depend on n. Next we show that the covariance function is shift- 
invariant by first showing that the correlation is shift-invariant: 


Rx x (k,l) = E{X |k] X*[]} 


+00 +00 
= / f TkT] fx (Tk, 21)dr,pdz, 
—oo —co 


+co p+oo 
= J J Intk@rniifx (Lrtk, En41)din4edFn41,' 
-oo J- 
= Rxx[n+ k,n +1], 


tThese middle two lines use our simplified notation. They are not trivially equal because fx (£k, £1) and 
fx (Zein, Zi+n) are really the joint densities at two different pairs of times. This can be made clear using 
the full notation: fx (£k, 21; k,l). 
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since fx(x,,21) doesn’t depend on the shift n, and the z;’s are dummy variables. Finally, 
we use Equation 8.1-13 and the result on the mean functions to conclude that the covariance 
function is also shift-invariant. Since the covariance function is shift-invariant for any WSS 
random sequence, we can define a one-parameter covariance function to simplify the notation 
for WSS sequences 


Kxx(m] Ê E{X.[k + m)Xz[k]} = Kxx[k +m, k] 
= Kxx[m,0]. (8.1-18) 


We also do the same for correlation functions. Writing the one-parameter correlation func- 
tion in terms of the corresponding two-parameter correlation function, we have 


Rxx[m] = Rxx|k+m,k] = Rxx[m,0]. W 





Example 8.1-14 
(WSS covariance function) The covariance function of Example 8.1-12 is shift-invariant 
and so we can take advantage of the simplified notation. We can thus write Kxx|m] = 
a? (26[m] + lm — 1] + fm + 1). 


Example 8.1-15 
(two-state random sequence with memory) We construct the two-level (binary) random 
sequence X[n] on n > 0 as follows. Recursively, and for each n (>0) in succession, and for 
each level, we set X[n] = X[n — 1] with probability p, for some given 0 < p < 1. Otherwise, 





and with probability q 4i- p, we set X[n] to the “other” value (level). Let the two levels 
be denoted a and b, and start off the sequence with X[0] = a. When p = 0.5, this is a 
special case of the Bernoulli random sequence. When p Æ 0.5, this is not an independent 
random sequence, since Px (£n|£n-1;n, n — 1) #4 Px(gn;n). We say the random sequence 
has memory. To see this, consider the case where p = 1.0; then set £n to the level other than 
Zn—1, and note that the conditional transition probability Px (z,|tn-1; n,n — 1) = 0, while 
the unconditional probability Py (£n; n) is not so constrained. In fact, Px (en; n) would not 
be expected to favor either level, since the above transition rules are the same for either 
level. Intuitively, at least, it makes sense to call X [|n — 1] the state at time n— 1. In fact, the 
rules for generating this random sequence can be summarized in the state-transition diagram 
shown in Figure 8.1-14, where the directed branches are labeled by the relevant probabilities 
for the next state, given the present state, as can easily be verified by inspection. We can 
refer to p as the no-transition probability. This is a first example of a Markov random 
sequence which will be studied in Section 8.5. 

The following MATLAB m-file can generate sample functions for these random sequences 
onn > 1: l 


function [w]=randmemseq(p,N,w0,a,b) 
w=a*ones(1,N); 
w(1)=w0; 
for i=2:N 
rnum=rand; 
if rnum <p; 
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q =1-p 





Figure 8.1-14 State-transition diagram of two-state (binary) random sequence with memory. 


w(i)=w(i-1); 
else 
if w(i-1)==a; 
w(i)=b; 
else 
w(i)=a; 
end 
end 
stem(1i:N,w) 
title(’random sequence with memory’) 
xlabel(’discrete time’) 
ylabel(’level’) 
end 


Sample waveforms are given in Figures 8.1-15 to 8.1-17 corresponding to level values 
b = 1, a= 0, and several values of p. We note that when p is near 1, there are few transitions. 
For p near 0.5, there will be many transitions displaying little memory. When p = 0, there 
is a transition every time. 


Example 8.1-16 —  ———eeeeeSsSsSssssseSesesesssee o 
_ (correlation function of random sequence with memory) Assume that the random sequence 
with memory of the last example has been running for a very long time. Later on we will 
show that in this case, a steady state develops wherein the probabilities of the two levels 
are constant with time and independent of the starting state (level). Here we assume that 
the steady state holds for all finite time. Clearly from the symmetry shown in the state 
diagram, it must be that Px(a) = Px(b) = 0.5. Now assume that the lower level a = 0 
and the upper level is b as before, and consider the correlation at two distinct times n and 
n + k. We can write 


Rxx|n, n+ k] = b? Px (b, bj n,n + k) 
= b P(X [n] = b)P(X[n + k] = b| Xin] = b) 
= (b°/2) P(X[n + k] = b|X fn] = b), 
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Level 


Figure 8.1-15 


Level 





0 | i 


0 5 10 15 20 25 30 35 40 45 50 
Time index n 





Initial level X [1] = 1, no-transition probability p = 0.8. 


0 
0 5 10 15 20 25 30 35 40 45 50 
Time index n 


Figure 8.1-16 initial level X [1] = 1, no-transition probability p = 0.5, the Bernoulli case. 


Level 


Figure 8.1-17 


0 5 10 15 20 25 30 35 40 45 50 
Time index n 


Initial value X [1] = 1, no-transition probability p = 0. 
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where the first equality holds since all the terms involving a are zero since a = 0. Now the 


only way that X can equal b at both times n and n+ k is for an even number of transitions 
to occur between these two times, and the probability of this is given by 


k 
P{even number of transitions} = > (7) (1 — p)'p* 


1=0,2,4,... 
A 
= Ae, 
which follows from the fact that this is just Bernoulli trials with “success” = “transition” 
and “failure” = “no transition.” Thus interchanging the usual role of p and q in Bernoulli 


trials, we just add up the probability of an even number of successes (transitions). It turns 
out that A, can be evaluated in closed form by the following “trick.” Define 


k 
de® > (7) apo (1 


Clearly, we have A, — Ao = 1 since | is always odd valued in the sum Ao. Similarly we note 
that 


where the first equality holds because | is always even in Ae. We now can see that 


At A= > (7) (p - 1)'p** 


1=0 
= (2p = 1)*, 
by the Binomial Theorem. It follows at once that Ae = (1/2) [(2p — 1)* + 1] , so that 
Rxx|[n,n + k] = (8/4) [(2p — 1)* + 1], 


which shows that X [n] is WSS. We can write this correlation function more cleanly for the 
case p > 1/2. On defining a £ In(2p — 1)|, we have 


Rxx[k] = (8/4) [exp(—a|k|) + 1). 
Also since the mean value of X [n] is easily seen to be b/2, we get the autocovariance function 


Kx x[k] = (67/4) exp(—a |kl). 
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A MATLAB m-file for displaying the covariance functions of these sequences, for three values 
of p, is shown below: 


function (mcl1 ,mc2,mc3]=markov(b,p1,p2,p3,N) 

mci=O0*ones(1,N); 

mc2=0*ones(1,N); 

mc3=0*ones(1,N); 

for i=1:N 
mc1(i)=0.25*(b72)*(((2*p1-1)7(i-1))); 
mc2(i)=0.25* (b~2) *(((2*p2-1)*(i-1))); 
mc3(i)=0.25* (b~2) *(((2*p3-1) 7 (i-1))); 

end 

x=linspace(0,N-1,N); 

plot (x,mc1,x,mc2,x,mc3) 

title(‘covariance of Markov Sequences’) 

xlabel (‘Lag interval’) 

ylabel (‘covariance value’) 


The normalized covariances for p = 0.8, 0.5, and 0.2 and b = 2 are shown in Figure 8.1-18. 


Covariance value 





“0 2 4 6 8 10 12 14 
Lag interval 


Figure 8.1-18 The covariance functions for different values of the parameter p. (Points connected by 
straight lines.) 
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This type of random sequence, which exhibits a one-step memory, is called a Markov 
random sequence (there are variations on the spelling of Markov) in honor of the mathemati- 
cian A. A. Markov (1856-1922). In Section 8.5 we shall discuss this class of random sequences 
in greater detail. In the meanwhile we note that the system discussed in Example 8.1-5, 
that is, X[n] = aX |n — 1] + Wn], also exhibited a one-step memory and, hence could also 
be regarded as a Markov sequence, when W [n] is an independent random sequence. 








In Section 8.2, we provide a review or summary of the theory of linear systems for 
sequences, that is, discrete-time linear system theory. Readers with adequate background 
may skip this section. In Section 8.3, we will apply this theory to study the effect of 
linear systems on random sequences, an area rich in applications in communications, signal 
processing, and control systems. 


8.2 BASIC PRINCIPLES OF DISCRETE-TIME LINEAR SYSTEMS 


In this section we present some fundamental material on discrete-time linear system theory. 
This will then be extended in the next section to the case of random sequence inputs 
and outputs. This material is very similar to the continuous-time linear system theory 
including the topics of differential equations, Fourier transforms, and Laplace transforms. 
The corresponding quantities in the discrete-time theory are difference equations, Fourier 
transforms (for discrete-time signals), and Z-transforms. 

With reference to Figure 8.2-1 we see that a linear system can be thought of as having 
an infinite-length sequence z[n] as input with a corresponding infinite-length sequence y(n] 
as output. Representing this linear operation in equation form we have 


yln] = L{z[n]}, (8.2-1) 


where the linear operator L is defined to satisfy the following definition adapted to the 
case of discrete-time signals. This notation might appear to indicate that x[n] at time n 
is the only input value that affects the output y[n] at time n. In fact, all input values 
can potentially affect the output at any time n. This is why we call L an operator? and 
not merely a function. The examples below will make this point clear. Mathematicians 


xin] L{e} yin] 


Figure 8.2-1 System diagram for generic linear system L{-} with input x[n] and output y{n] and time 
index parameter n. 


tOperators map functions (sequences) into functions (sequences). 
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use the operator notation y = L{z} which avoids this difficulty but makes the func- 
tional dependence of z and y on the (time) parameter n less clear than in our engineering 
notation. 


Definition 8.2-1 We say a system with operator L is linear if for all permissible input 
sequences £1[n] and x(n], and for all permissible pairs of scalar gains a; and a2, we have 


L {a121[n] + agra[n]} = ai L{xi[n]} + azL{z2[n]}. E 


In words, the response of a linear system to a weighted sum of inputs is the weighted sum 
of the individual outputs. Examples of linear systems would include moving averages such as 


yin] = 0.33(z[n + 1] + zin] + z[n — 1)), —oo < n < +00, 
and autoregressions such as, 
yin] = ayin — 1] + by[n — 2] + czin], 0<n <+, 


when the initial conditions are zero. Both these equations are special cases of the more 
general linear constant-coefficient difference equation (LCCDE), 


M N. 
yin] = D any|n — k] + 5 byz[n — k]. (8.2-2) 
k=1 k=0 


Example 8.2-1 
(solution of difference equations) Consider the following second-order LCCDE, 





yin] = 1.7y[n — 1] — 0.72y[n — 2] + un], (8.2-3) 


with y[—1] = y[—2] = 0 and u[n] the unit-step function. To solve this equation for n > 0, 
we first find the general solution to the homogeneous equation 


ynin] = 1.7y,[n — 1] — 0.72y,[n — 2]. 
We try yn[n] = Ar”, where A and r are to be determined,' and obtain 
A(r® — 1.7r"—! +.0.72r"-?) = 0 


+A thorough treatment of the solution of linear difference equations may be found in [8-5]. 
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or 
Ar”? (r? — 1.7r + 0.72) = 0. 


We thus see that any value of r satisfying the characteristic equation 
r? —1.7r+0.72=0 


will give a general solution to the homogeneous equation. In this case there are two roots 
at rı = 0.8 and r2 = 0.9. By linear superposition the general homogeneous solution must 
be of the formt 


ynrin) = Air? + Aerts, 


where the constants A; and A may be determined from the initial conditions. 

To obtain the particular solution, we first observe that the input sequence ujn] equals 1 
for n > 0. Thus we try as a particular solution a constant, that is, following standard 
practice, 


yp|n] = B forn > 0 


and obtain 
B-1.7B+0.72B=1 


or 
B = 1/(1 — 1.7 + 0.72) = 1/(0.02) = 50. 


More generally this method can be modified for any input function of the form Cp” 
over adjoining time intervals [n1,n2 — 1]. One just assumes the corresponding form for 
the solution and determines the constant C as shown. In this approach, we would solve the 
difference equation for each time interval separately, piecing the solution together at the 
boundaries by carrying across final conditions to become the initial conditions for the next 
interval. We illustrate our approach here for the time interval starting at n = 0. The total 
solution is 


yin] = ynin] + vypln] 
= A,(0.8)" + A2(0.9)” + 50 for n > 0. 


To determine A, and Ag, we first evaluate Equation 8.2-3 at n = 0 and n = 1 using 
y[—1] = y[—2] = 0 to carry across the initial conditions to obtain y[0] = 1 and y{1] = 2.7, 
from which we obtain the linear equations 


A; +Ao+50=1 (atn=0) 


tSince the two roots are less than one in magnitude, the solution will be stable when run forward in 
time index n (cf. [8-5]). 
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and 
A;(0.8) + A2(0.9) + 50 = 2.7 (at n= 1). 


This can be put in matrix form 


1.0 1.0] [4] _ [—49.0 
0.8 0.9| | A| 7 | —47.3 


[a] [si] 


Thus the complete solution, valid for n > 0, is 


and solved to yield 


y(n] = 32(0.8)” — 81(0.9)” + 50. 
We could then write the solution for all time, if the system was at rest for n < 0, as 


yin] = {32(0.8)” — 81(0.9)” + 50} u[n]. 





Note that the LCCDE in the previous example is a linear system because the initial 
conditions, that is, y[—1], y[—2], were zero, often called the initial rest condition. Without 
initial rest, an LCCDE is not a linear system. More generally, linear systems are described 
by superposition with a possibly time-variant impulse response 


hin, k] Ê L{6[n — k]}. 


In words we call h|n, k] the response at time n to an impulse applied at time k. We derive 
the result by simply writing the input as z[n] = >> 2[k]6[n — k], and then using linearity to 
conclude 


+00 
yln] = L l S 2lk]é{n - ui} 


k=— 00 
+00 
= J. alk]L{5[n — k]} 
k=—00 
+00 


= J. afk] An, A], 


k=— 90 


which is called the superposition summation representation for linear systems. 

Many linear systems are made of constant components and have an effect on input 
signals that is invariant to when the signal arrives at the system. A linear system is called ` 
linear time-invariant (LTT) or, equivalently, linear shift-invariant (LSI) if the response to 
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a delayed (shifted) input is just the delayed (shifted) response. More precisely, we have the 
following. 


Definition 8.2-2 A linear system L is called shift-invariant if for all integer shifts k, 
—co < k < +00, we have 


yln +k] = L{2[n+k]} for all n. | (8.2-4) 


An important property of LSI systems is that they are described by convolution,’ that is, 
L is a convolution operator, 


yin] = hin] + z[n] = zin] + hin], 


where 4o 
hin] *2[n] 2 $> hfk]z[n— k], (8.2-5) 
k=-—-00 
and the sequence 
hin] Ê L{4{n}}, 


is called the impulse response. With relation to the time-varying impulse response h[n, k], 
we can see that hin] = h[n, 0] when a linear system is shift-invariant. 


In words we can say that—just as for continuous-time systems—if we know the impulse 
response of an LSI system, then we can compute the response to any other input by carrying 
out the convolution operation. In the discrete-time case this convolution operation is a 
summation rather than an integration, but the operation is otherwise the same. 

While in principle we could determine the output to any input, given knowledge of the 
impulse response, in practice the calculation of the convolution operation may be tedious 
and time consuming. To facilitate such calculations and also to gain added insight, we turn 
to a frequency-domain characterization of LSI systems. We begin by defining the Fourier 
transform (FT) for sequences as follows. 


Definition 8.2-3 The Fourier transform for a discrete-time signal or sequence is 
defined by the infinite sum (if it exists) 


. +00 
X(w) = FT {2[n]} 4 > zinet", for =r <w < +r, 


n=—00 


and the function X (w) is periodic with period 27 outside this range. The inverse Fourier 
transform is given as 


zin] = IFT {X(w)} = = TT (wei dw, a 


=T 


tWe encountered the operation of convolution in Chapter 3 when we computed the pdf of the sum of 
two independent RVs. 
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One can see that the Fourier transform and its inverse for sequences are really just 
the familiar Fourier series with the sequence x playing the role of the Fourier coefficients 
and the Fourier transform X playing the role of the periodic function. Thus, the existence 
and uniqueness theorems of Fourier series are immediately applicable here to the Fourier 
transform for discrete-time signals. Note that the frequency variable w is sometimes called 
normalized frequency because, if the sequence x|n] arose from sampling, the period of such 
sampling has been lost. It is as though the sample period were T = 1, as would be consistent. 
with the [—7, +7] frequency range of the Fourier transform X (w).t 

For an LSI system the Fourier transform is particularly significant owing to the fact 
that complex exponentials are the eigenfunctions of discrete-time linear systems, that is, 


L{el’™} = H (w), | (8.2-6) 


as long as the impulse response h is absolutely summable. For LSI systems this absolute 
summability can easily be seen to be equivalent to bounded-input bounded-output (BIBO) 
stability [8-5]. 

Just as in continuous-time system theory, multiplication of Fourier transforms cor- 
responds to convolution in the time (or space) domain. 


Theorem 8.2-1 (convolution theorem) The convolution, 
yin] = zin] * hin], =% < n < +00, 


is equivalent in the transform domain to 


Y (w) = X(w)H (w), =T Lw L Hr. 
Proof 
+00 . +00 ; 
Y(w) = J ule” = $ (el * hfn) em" 


=È Distal — Be" = D J alm — klee 
=J X elkle "hin — kee] 
n k 


= 5 z[kleJv* (= hin — geeen ) 
k n 

= XO zfkje FH (w) 
k 

= X(w)H (w). 


tIf the sequence arose from sampling with sample period T, the (true) radian frequency Q = w/T. 
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Thus, discrete-time linear shift-invariant systems are easily understood in the frequency 
domain similar to the situation for continuous-time LSI systems. Analogous to the Laplace 
transform for continuous-time signals, there is the Z-transform for discrete-time signals. It 
is defined as follows. $ 


Definition 8.2-4 The Z-transform of a discrete-time signal or sequence is defined as 
the infinite summation (if it exists) 


+00 


X(2)2 Ð aziz”, (8.2-7) 


n=— 00 
where z is a complex variable in the region of absolute convergence of this infinite sum.t E 


Note that X(z) is a function of a complex variable, while X(w) is a function of a real 
variable. The two are related by X(z)|,-~iw = X(w). We thus see that, if the Z-transform 
exists, the Fourier transform is just the restriction of the Z-transform to the unit circle in 
the complex z-plane. Similarly to the proof of Theorem 8.2-1, it is easy to show that the 
convolution-multiplication property Equation 8.2-1 is also true for Z-transforms. Analogous 
to continuous-time theory, the Z-transform H(z) of the impulse response h|n] of an LSI 
system is called the system function. For more information on discrete-time signals and 
systems, the reader is referred to [8-5]. 


8.3 RANDOM SEQUENCES AND LINEAR SYSTEMS 


In this section we look at the topic of linear systems with random sequence inputs. In 
particular we will look at how the mean and covariance functions are transformed by both 
linear and LSI systems. We will do this first for the general case of a nonstationary random 
sequence and then specialize to the more common case of a stationary sequence. The topics of 
this section are perhaps the most widely used concepts from the theory of random sequences. 
Applications arise in communications when analyzing signals and noise in linear filters, in 
digital signal processing for the analysis of quantization noise in digital filters, and in control 
theory to find the effect of disturbance inputs on an otherwise deterministic control system. 

The first issue is the meaning of inputing a random sequence to a linear system. The 
problem is that a random sequence is not just one sequence but a whole family of sequences 
indexed by the parameter Ç, a point (outcome) in the sample space. As such for each fixed 
¢, the random sequence is just an ordinary sequence that may be a permissible input for 
the linear system. Thus, when we talk about a linear system with a random sequence input, 
it is natural to say that for each point in the sample space 2, we input the corresponding 
realization, that is, the sample sequence z[n]. We would therefore regard the corresponding 
output y|n] as a sample sequence! corresponding to the same point ¢ in the sample space, 
thus collectively defining the output random sequence Y [n]. 


tNote the sans serif font to distinguish between the Z-transform and the Fourier transform. 
tRecall that z[n], y[n] denote X[n,¢], Y[n,¢], respectively, for fixed ¢. 
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Definition 8.3-1 When we write Y [|n] = L{X[n]} for a random sequence X(n] and 
a linear system L, we mean that for each Ç € 2 we have 


Y|n, c] =L {X|n, ¢]} . 


Equivalently, for each sample function x[n] taken on by the input random sequence X[n], 
we set y[n] as the corresponding sample sequence of the output random sequence Y |n], that 
is, y[n] = L{z[n]}. m 


This is the simplest way to treat systems with random inputs. A difficulty arises when 
the input sample sequences do not “behave well,” in which case it may not be possible to 
define the system operation for every one of them. In Chapter 10 we will generalize this 
definition and discuss a so-called mean-square description of the system operation, which 
avoids such problems, although of necessity it will be more abstract. 

In most cases it is very hard to find the probability distribution of the output from 
the probabilistic description of the input to a linear system. The reason is that since the 
impulse response is often very long (or infinitely long), high-order distributions of the input 
sequence would be required to determine the output CDF. In other words, if Y [n] depends 
on the most recent k input values X[n],...,X[n — k + 1], then the kth-order pdf of X 
has to be known in order to compute even the first-order pdf of Y. The situation with 
moment functions is different. The moments of the output random sequence can be calcu- 
lated from equal- or lower-order moments of the input, when the system is linear. Partly for 
this reason, it is of considerable interest to determine the output moment functions in terms 
of the input moment functions. In the practical and important case of the Gaussian random 
sequence, we have seen that the entire probabilistic description depends only on the mean 
and covariance functions. In fact because the linear system is in effect performing a linear 
transformation on the infinite-dimensional vector that constitutes the input sequence, we 
can see that the output sequence will also obey the Gaussian law in its nth-order distribu- 
tions if the input sequence is Gaussian. Thus, the determination of the first- and second- 
order moment functions of the output is particularly important when the input sequence is 
Gaussian. 


Theorem 8.3-1 For a linear system L and a random sequence X [n], the mean of the 
output random sequence Y [n] is 


E{Y [n]} = L{E{X|n]}} (8.3-1) 
as long as both sides are well defined. 


Proof (formal). Since L is a linear operator, we can write 


+00 
yin] = X. hfn, klef] 


k=—o0 
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for each sample sequence input-output pair, or 


+00 
Yin] = >> hin, k] Xk, ¢], 
k=—00 
where we explicitly indicate the outcome Ç. If we operate on both sides with the expectation 
operator E, we get 


+00 
E{Y [n]} = zf Y hjn, uxw}. 
k=—co 


Now, assuming it is valid to bring the operator E inside the infinite sum, we get 


+00 
E{Y[n]} = X. hfn, kE{X[A]} 


k=—0o 
= L{E{X|n]}}, 


which can be written as 


+00 
pyi] = X. bln, kluxlk], 
k=—00 
that is, the mean function of the output is the response of the linear system to the mean 
function of the input. E 


Some comments are necessary with regard to this interchange of the expectation and 
linear operator. It cannot always be done! For example, if the input has a nonzero mean 
function and the linear system is a running sum, that is, 

+00 
ln] = > zin — k, 
k=0 
the running sum of the mean may not converge. Then such an interchange is not valid. We 
will come back to this point when we study stochastic convergence in Section 8.7. We will 
see then that a sufficient condition for an LSI system to satisfy Equation 8.3-1 is that its 
impulse response h|n] be absolutely summable. 

There are special cases of Equation 8.3-1 depending on whether the input sequence is 
WSS and whether the system is LSI. If the system is LSI and the input is at least WSS, 
then the mean of the output is given as 


+00 
E{Y[n]}= $, hhn- klax- 


k=—00o `. 
Now because px is a constant, we can take it out of the sum and obtain 
+00 
E{Y[n]} = | > ns ux (8.3-2) 
k=—00 
= H(z)|2=1 Lx, (8.3-3) 


at least whenever X`} .. |h[k]| exists, that is, for any BIBO stable system (cf. Section 8.2). 


k=—00 
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Thus, we observe that in this case the mean of the output random sequence is a constant 
equal to the product of the dc gain or constant gain of the LSI system times the mean of 
the input sequence. 


Example 8.3-1 
(lowpass filter) Let the system be a lowpass filter with system function 





H(z) =1/(L+az2~"), 


where we require |a| < 1 for stability of this assumed causal filter (i.e., the region of 
convergence is |z| > |a|, which includes the unit circle). Then if a WSS sequence is the 
input to this filter, the mean of the output will be 


E{Y[n]} = H(z)|21 E{ X [n]} 


= (1+a)*px. 


We now turn to the problem of calculating the output covariance and correlation of the 
general linear system whose operator is L: 


Y [n] = L{X{n]}. 


We will find it convenient to introduce a cross-correlation function between the input 
and output, 


Rxy [m,n] Ê E{X[m]Y*[n}} (8.3-4) 
= E{X[m] (L{X[n]}*}. (8.3-5) 


Now, in order to factor out the operator, we introduce the operator Lž, with impulse 
response h*[n, k], which operates on time index k, but treats time index n as a constant. 
We can then write Rxy[m,n] = E{X|[m]L*[X*[n]]} = LEE{X|[m]X*[n]}. Similarly we 
denote with Lm the linear operator with time index m, that treats n as a constant. The 
operator L* is related to the adjoint operator studied in linear algebra. 


Theorem 8.3-2 Let X[n] and Y[n] be two random sequences that are the input 
and output, respectively, of the linear operator Ln. Let the input correlation function be 
Rxx [m,n]. Then the cross- and output-correlation functions are, respectively, given by 


Rxy In, n] = L; {Rxx [m, n]} 


and 


Ryy [m, n] = Lm {Rxy [m, n|} . 
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Proof Write 
X[m]¥*[n] = X[m]L7{X*[n}} 
= Lr {X [m]X* [n]} 
Then 
Rxy|m,n] = E{X[m]¥*|n]} = {LA {X[m]X*|n]}} 
= LAA E{X[m]X*|n]}} 
= L,{Rxx|m,n]}, 


thus establishing the first part of the theorem. To show the second part, we proceed analo- 
gously by multiplying Y [m] by Y*{n] to get 


E{Y [m]Y*[n]} = E{Lm{X[m]Y*[n]}} 
= Lm{E{X [m]¥*[n]}} 
= Lm{Rxy[m, n]}, 


as was to be shown. m 
If we combine both parts of Theorem 8.3-2 we get an operator expression for the output 
correlation in terms of the input correlation function: 


Ryy |m, n] = Im{L7{Rxx|m, nj}}, (8.3-6) 


which can be put into the form of a superposition summation for a system with time-variant 
impulse response hjn, k] as 


+00 +00 
Ryyl|m,n] = X. alm, kl ( 5 ito lRxtkd) (8.3-7) 
k=- l=- 


Here the superposition summation representation for Ryxy [m,n] is 


Rxy [m,n] = Li {Rxx|m, nj} 


+00 
= > h*[n, URxx|m, l], 


l=—co 


and that for Ry x[m,n] is 


+00 
Ry x[m, n] = D him, k]Rxx [k,n]. 


k=—00 


To find the corresponding results for covariance functions, we note that the centered output 
sequence is the output due to the centered input sequence, due to the linearity of the system 
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and Equation 8.3-1. Then applying Theorem 8.3-2 to these zero-mean sequences, we have 
immediately that, for covariance functions, 


Kxy [m,n] = L{Kxxlm,n]} (8.3-8) 
Kyy [m,n] = Lm{Kxy[m,n]} (8-3-9) 
and 

Kyy[m, n] = Lm{L7{Kxx|m, n}}}, (8.3-10) 

which becomes the following superposition summation 

+00 +00 
Kyy[m, n] = > him, k] ( > h* In, WKxx\|k, n) . (8.3-11) 
k=—00 l=—co 


Example 8.3-2 
(edge detector) Let Y [|n] 2x In] — X[n — 1] = L{X[n]}, an operator that represents a 
first-order (backward) difference. See Figure 10.3-1. This linear operator could be applied 
to locate an impulse noise spike in some random data. The output mean is E [Y[n]] = 
L{E|X[n]]} = exin] — ux[n — 1]. The cross-correlation function is 





Rxy [m,n] = Lp{Rxx|m,n]} 
= Rxx[m, n] — Rxx[m,n — 1]. 
The output autocorrelation function is 
Ryy [m,n] = Lm{Rxy[m,n]} 

= Rxy|m,n] — Rxy[m — 1, n] 

= Rxx[m,n] — Rxx[m—1,n] — Rxx[m,n — 1] + Rxx[m - 1,n - 1]. 
If the input random sequence were WSS with autocorrelation function, 

Rxx([m,n] = a=", 0<a<i, 


then the above example would specialize to 





Delay unit 





Figure 8.3-1 An edge detector that gives nearly zero output when X[n] = X[n—1] and a large output 
when |X[n] — X[n — 1]| is large. 
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Figure 8.3-2 Input correlation function for edge detector with a = 0.7. 


Hy [n] =0, 


Rey [m,n] = alr! — gin 


and 


Ryy([m, n] — Qqim—nl — qgim—1-n| _ alr, 


which depends on only m — n. Hence the output random sequence is WSS and we can write 
(with k=m-n) 

Ryy [k] = 2a!*! — alk! — alk+1, 
For the input autocorrelation with a = 0.7 as shown in Figure 8.3-2, the output autocorre- 


lation function is shown in Figure 8.3-3. Note that the edge detector has a strong tendency 
to decorrelate the input sequence. 


Example 8.3-3 
(covariance functions of a recursive system) With |a| < 1, let 





Y[n] = aY |n — 1] + (1 -a)W[n] (8.3-12) 


for n > 0 subject to Y [—1] = 0. Since the initial condition is zero, the system is equivalently 
LSI for n > 0, so we can represent L by convolution, where 


hin] = (1 — aja" yn). 


Here h[n] is the impulse response of the corresponding deterministic first-order difference 
equation, that is, h[n] is the solution to 
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Correlation function Ryk] 
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Figure 8.3-3 Correlation function Ryy[k] for backward difference example (plot has a = 0.7). 


hin] = ah[n — 1] + (1 — a)6[n], 


where 6[n] is the discrete-time impulse sequence. This solution can be obtained easily by 
recursion or by using the Z-transform.' Then specializing Equation 8.3-1, we obtain 


o0 


uy [n] = ya —a)a* uy [n — k], where py[n] = 0 for n < 0. 
k=0 


Applying Equations 8.3-8 and 8.3-9 to this case enables us to write, for a real, 


Kwy[m, n] = ya 7 aja* Kww|m,n 7 k] 
k=0 


and 
co 


Kyy[m,n] = ya — aja! Kwy [m — l, n), 
1=0 


which can be combined to yield 


Kyy|m,n] = > ya - a)’ aa Kww[m —l,n — k]. 
k=0 i=0 


tTaking the Z-transform of both sides of the above equation, and noting that the Z-transform of the 
impulse sequence is 1, we obtain H(z) = (1 — a)/(1 — az—1). Upon applying the inverse Z-transform, one 
gets the h[n] given above. (For help with the inverse Z-transform, see Appendix A.) 
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Now if the input sequence W[n] has covariance function 
Kww|m,n] = 03,6[m — n] for m,n > 0, 
then the output covariance is calculated as 


n 
Kyy[m, n] = ya - a)yakal™—™+kg2 for m>n>0, 
k=0 


n 
=al™"(1— a)? D atoy 
k=0 
al™—") (1 — @)?/(1—a@”)] oy -at) for m>n>0 


[1 — a@)/(1 + ajja "o? (1 — a? in(™.™)+2) forall m,n > 0, 


where the last step follows from the required symmetry in (m,n). Note that the term 
a?min(m,.n)+2 is a transient that dies away as m,n — oo, since |a| < 1, so that asymptotically 
we have the steady-state answer 


l-a 
l+a 





Kyy|[m,n] = ( ) of, al™-"|, m,n > 00, 

a shift-invariant covariance function. If the mean function py [n] is found to be asymptotic 
to a constant, then the random sequence Y fn] is said to be asymptotically WSS. We discuss 
WSS random sequences in greater detail in the next section. 








As an alternative to this method of solution, one can take the expectation of 
Equation 8.3-12 to directly obtain a recursive equation for the output mean sequence which 
can be solved by the methods of Section 8.2: 


By[n] = apy[n — 1] + (1 — ajuw n], n> 0, 
with an appropriate initial condition. For example, if wy [—1] = 0 and py[n] = uy, a given 
constant, then the solution is 
nyin] = (1 — a*")pyyufn]. 


We can also use this method to calculate the cross-correlation function between input and 
output. First we conjugate Equation 8.3-12, then multiply by W[m], and finally take the 
expectation to yield, for a real, 


Rwy [m,n] = aRwy[m,n — 1] + (1 — a)Rww [m,n], (8.3-13) 
which can be solved directly for Rwy in terms of Rww. The partial difference equation for 


the output correlation Ryy is obtained by re-expressing Equation 8.3-12 as a function of 
m, multiplying by Y*|n], and then taking the expectation to yield 
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Ryy [m,n] = aRyy[m — 1,n] + (1 — a)Rwy[m, n]. (8.3-14) 


These two difference equations can be solved by the methods of Section 8.2 since they can 
each be seen to be one-dimensional difference equations with constant coefficients in one 
index, with the other index simply playing the role of an additional parameter. Thus, for 
example, one must solve Equation 8.3-13 as a function of n for each value of m in succession. 


8.4 WSS RANDOM SEQUENCES 


In this section we will assume that the random sequences of interest are all WSS, that is, 
(1) E{X[n]} = ux, a constant, 
(2) Rxx[k-+m,k] = E{X[k + m]X*|k]} 
= Rxx[m], 


and of second order, that is, E{|X[n]|?} < oo. 
Some important properties of the autocorrelation function of stationary random 
sequences are presented below. They also hold for covariance functions, since they are just 


the autocorrelation function of the centered random sequence X,[n] 2x În] — px. 


1. For arbitrary m, |Rxx[m]| < Rxx[0] > 0, which follows directly from 
E{|X[m] — X(0]|?} > 0 for X[n] real valued, otherwise use Schwarz inequality (cf. 
Equation 4.3-15). 

2. |Rxy|m]| < /Rxx[0|Ryy [0], which is derived using the Schwarz inequality. 

3. Rxx[m] = RXx[-m] since Rxx[m] = E{X[m + J X*[l]} = E{X[X*[l — m]} = 
E*{X[l— m) X*[]} = Ryx[—m]. 


4. For all N > 0 and all complex a;,a2,...,an, we must have 
N N 
D X ana Rxx|n — k] > 0. 
n=1 k=1 


Property 4 is the positive semidefinite property of autocorrelation functions. It is a 
necessary and sufficient property for a function to be a valid autocorrelation function of a 
random sequence. In’ general it is very difficult to directly apply property 4 to test a function 
to see if it qualifies as a valid autocorrelation function. However, we soon will introduce an 
equivalent frequency domain function called power spectral density, which furnishes an easy 
test of validity. 

Many of the input-output relations derived in the previous section take a surpris- 
ingly simple form in the case of WSS random sequences and LSI systems described via 
convolution. For example, starting with 


+020 
Yin] = So hin- k]X[k], 


k=—00 
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we obtain 


Rxy[m, n] = E{X[m]Y*[n}} 


+00 
= So ht [n— k)E{X[m]X"*[k]} 


k=—00 


+00 
= SO hhn- k]Rxx[m- k] 


k=—0o 
+00 A 
= SO A*-lRxx[(m—n)-], withl2k-n, 
k=—co 


if the input random sequence X|n] is WSS. So, the output cross-correlation function 
Rxy|m,n] is shift-invariant, and we can make use of the one-parameter cross-correlation 


function Rxy [m] 4 Rxy|m, 0] to write 


+90 f 
Rxy[m] = ` k*[-]RxxĮm- 1] 


l=—00 


= h*[-m] * Rxx (ml, 
in terms of the one-parameter autocorrelation function Rx x |m]. Likewise, recalling that 


Ryy[n+m,n] Ê E{Y [n+ ml]Y*{n]} 


+00 
= SO Alk|E{X[n +m — k]¥*[n]} 


k=—00 


+00 
= J. Alk|Rxy[m—k] 


k=—00 
= him] * Rxy[ml, 


we see that the autocorrelation function of the output is shift-invariant, and so making use 
of the one-parameter autocorrelation function Ryy [m] 4 Ryy|m, 0], we have 


Ryy|m] = him] + Rxy Im]. 
Combining both equations, we get 
Ryy [m] = him] * h*[—m] * Rxx [m] 
= (h[m] * h*|-m]) « Rxx [m] (8.4-1) 


= gim] * Rxxim], — with gim] Ê Alm] + h*[-m] 
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where g[m] is sometimes called the autocorrelation impulse response (AIR). Note that if 
the input random sequence is WSS and independent, then its autocorrelation function 
would be a positive constant times 6[m], so that taking this constant to be unity, we would 
have the output autocorrelation function equal to g|m] itself. Therefore, g[m] must possess 
all the properties of autocorrelation functions, that is , g[l] = g*[—I], gļ0] > g[l] for all ł, 
and positive semidefiniteness. The AIR g depends only on the impulse response h of the 
LSI system; however, in the absence of other information, we cannot uniquely determine h 
from g. In astronomy, crystallography, and other fields the problem of estimating h from 
the AIR is an important problem known by various names including phase recovery and 
deconvolution. 


Example 8.4-1 
(impulse response) We cannot in general calculate the impulse response from the AIR. 
To show this, first take the Fourier transform of g[m] to obtain G(w) = H(w)H*(w) = 
|H(w)|?. Then note that |H(w)| = /G(w). Thus the phase of H(w) is lost in the AIR, but 
the magnitude of H(w) is preserved. Often there is some information available that can 
narrow down or possibly pinpoint the phase, for example, the support of h[n] in an image 
application, or causality for a time-based signal. For the interested reader, the literature 
contains many articles on this subject; see for example [8-6]. 





Example 8.4-2 
(correlation function analysis of the edge detector using impulse response) In the edge 
detector of Example 8.3-2, the linear transformation was given as 








Yin] = L{X{[n}} Ê Xħin] - X[n - 1], 


an LSI operation with impulse response h|n] = 6[n] — 6[n — 1], and input autocorrelation 
function Rx x[m] = a!™, with |a| < 1. We can easily calculate the AIR as 


glm] = him] + h[—m] 
= (6{m] — 6[m — 1]) » (6[-m] — 6[-m — 1) 
= (6[m] — d[m — 1]) « (6[m] — [m + 1) 
= 6[m] — 6[m — 1] — 6[m + 1] + d[m] 
= 26[m] — 6[m — 1) — 6[m+ 1]. 
We then calculate the output autocorrelation function in this WSS case as 
Ryy [m] = gim] + Rxx[m!] 
= (26[m] — 6[m — 1] — 6[m + 1]) * al™ 
= 2al™l — gir — galmt, for — oo < m < +00, 


which agrees with the answer in Example 8.3-2, where the result was plotted for a = 0.7. 
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Power Spectral Density 


We define power spectral density (psd) as the FT (cf. Definition 8.2-3) of the one-parameter 
discrete-time autocorrelation function of a WSS random sequence X [n]: 


+00 
Sxx(w) 4 5D Rxx[m]exp(—-jwm), for — r <w < +r. (8.4-2) 


m = — 00 


Now by taking the FT of Equation 8.4-1, we get the following important psd input/output 
relation for an LSI system excited by a WSS random sequence: 


Syy (w) = |H (w)? Sx x(w) = G(w)Sxx (w), (8.4-3) 


where the various frequency-domain quantities are discrete-time Fourier transforms. 
Equation 8.4-3 is a central result in the theory of WSS random sequences in that it enables 
the computation of the output psd directly from knowledge of the input psd and the transfer 
function magnitude. By using the IFT, we can calculate the autocorrelation function as 
1 pt , 
Rxx|m] = IFT {Sxx(w)} = z | Sxx(w)e™™dw, 
T 


so that knowledge of the psd implies knowledge of the autocorrelation function. 
As to the name power spectral density, note that Rx x [0] = E{|X[n]|?} is the ensemble 
average power in X[n] and so by the above relation, we see that 
1 st" 
E{|X[n]|?} = Rxx[0] = z= Sxx(w)dw, 
N T 
so that the integral average of the psd over its frequency range [—7,-+7] is indeed average 
power. To pursue this further, we consider a WSS random sequence X[n] input to an LSI 
system consisting of a narrow band filter H(w), with very small bandwidth 2c, centered at 
frequency wo, where |w,| < 2, and with unity passband gain. Writing Sx x(w) for the input 
psd, we have for the output ensemble average power, approximately 


Qn 


Wo—-€ 


1 WotE € 
R= f Sxx (w) dw ~ Sxx (wo) =, 


thus showing that Sx x(w) can be interpreted as a density function in frequency for ensemble 
average power. 
Some important properties of the psd are given below: 


1. The function Sx x(w) is real valued since Rx x [m] is conjugate-symmetric. 

2. If X[n] is a real-valued random sequence, then Sxx (w) is an even function of w. 

3. The function Sx x(w) > 0 for every w, whether X[n] is real- or complex-valued. 

4. If Rxx|m] = 0 for all |n| > N for some finite integer N > 0 (ie., it has finite 
support), then Sxx(w) is an analytic function in w. This means that Sx x (w) can 
be represented in a Taylor series given its value and that of all its derivatives at a 
single point wo- 
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Since Sx x(w) is the Fourier transform of a (autocorrelation) sequence, it is periodic with 
period 27. This is why the inverse Fourier transform, which recovers the autocorrelation 
function, only integrates over [—7,+7], the primary period. We also define the Fourier 
transform of the cross-correlation function of two jointly WSS random sequences: 


Sxy(w ay Rxy|m]exp(—jwm), for -rn <w < +r, 


m=—oo 


called the cross-power spectral density between random sequences X and Y. In general, this 
cross-power spectral density can be complex, negative, and lacking in symmetry. Its main 
use is as an intermediate step in calculation of psd’s. 


interpretation of the psd 


From its name, we expect that the psd should be related to some kind of average of the 
magnitude-square of the Fourier transform of the random signal. Now since a WSS random 
signal X[n] has constant average power Rx x([0] for all time, we cannot define its FT; 
however, we can define the transform quantity 

Xn (w) Ê FT {wy {n]X[n]} 
with aid of the rectangular window function 


A fl, nl <N, 
ww [rn] = 0, else. 


Then, taking the expectation of the magnitude square |X (w)|*, and dividing by 2N + 1, 
we get 








1 +N +N 
on a ENOPS wei > $S XU eaeh ep) 
k=-N l=-N 
LOS SS ecxaxup 
= —~—_ E{ X[k] X* i] } exp(—jwk) exp(+juwl 
2N +1 oo p(—jwk) exp(+jwl) 
LES Jexpl 
= ——. Rxx(k — l exp[|—jw(k —1 
INFI 24 24 xÍ xp[—jw(k — 1)] 
> Iml 
= 2o Rxx[m] (1 — ONG z) exp(—jwm), 


where the last line comes from the fact that Rx x [k—I] is constant along diagonals k—l = m 
of the (2N +1) x (2N +1) point square in the (k, 1) plane. 

Now as N — oo, the triangular function (1— eh) has less and less effect if |x x [m]| — 
0 as |m| — oo, as it must for the Fourier transform, that is, Sxx(w) to exist. In fact, 
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if we assume that |Rx x [ml] decays fast enough to satisfy 77" .. |m||Rxx[ml]| < 00, then 
we have 


Sxx(w) = Jim ———_ FE {|X (w)|?}. (8.4-4) 


SW +1 
In words we have that the ensemble average of the power at frequency w in the windowed 
random sequence is given by the psd Sx x(w). Note that we have said nothing about the 
variance of the random variable zy yail|Xn(w)|?, but just that its mean value converges 
to the psd. In the study of spectral estimation (cf. Section 11.6), it is shown that the 
variance does not get small as N gets large, so that zx T |X~(w)|? cannot be considered a 
good estimate of the psd without first doing some averaging. In the language of statistics 
(Chapter 6) we say that (2N + 1)—1|Xn(w)|? is not a consistent estimator for Sx x(w). 








Example 8.4-3 
Here is a MATLAB m-file to compute the psd’s of the random sequences with memory in 
Example 8.1-16 for p = 0.8, 0.5, and 0.2. 





function [psd1, psd2,psd3]=psdmarkov2(N,p1,p2,p3)} 
mci=0*ones(1,N); 
mc2=0*ones(1,N); 
mc3=O0*ones(1,N); 

for i=1:N 

mci(i)=0.25*(((-1)*(2*p1-1))*(i-1));% The (-1)7"(i-1) factor shifts the 
spectrum to yield 

mc2(i)=0.25*(((-1)*(2*p2-1))*(i-1));%an even function of frequency. 
Otherwise 

mc3(i)=0.25*(((-1) *(2*p3-1))*(i-1)) ;%the highest frequency 
components appear 


end 

x=linspace(-pi,pi,N);4at pi and the lowest at 2*pi. 
psdi=abs(fft(mc1)); 

psd2=abs (fft (mc2)); 

psd3=abs (fft(mc3)) ; 

plot (x,psd1,x,psd2,x,psd3) 

title(’Power spectral density (psd) of random sequences with memory’) 
xlabel(’radian frequency’) 

ylabel(’psd value’) 

end 


See the three plots in Figure 8.4-1. 


Example 8.4-4 — — 
A stationary random sequence X[(n] has power spectral density Sx x(w) = Now(3w/4z), 
where the rectangular window function w is given as 
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Power spectral density (psd) of random sequences with memory 





Radian frequency {w} 


Figure 8.4-1 Power spectral densities of three stationary random sequences with memory. 


w(x) A fa |z] < 1/2, 


0, else. 


It is desired to produce an output random sequence Y |n] with the psd Syy (w) = Now(w/z). 
An LSI system (not necessarily causal) with impulse response h[n] is proposed. Which of 


the following impulse responses should be used? (Note that sinc(z) Ê sin(nz)/rz.) 
(a) 2sine(n/2), 
(b) 5sine((n — 10)/2), 
(c) 1.5e7!"!u[n], 
(a) ujn +2] —uln — 2], 
(e) (1 — |n|)w(n/2). 


Solution Clearly what is needed is an H(w) with transfer-function magnitude |H(w)| = 
w(w/7). Choices (c) through (e) are ruled out immediately because their Fourier transforms 
do not have constant magnitude inside any frequency band. Since the IFT of w(w/r) is 
3 sinc(n/2), we choose (b) since its 10-sample delay does not affect the magnitude |H(w)|. 





A useful summary of input/output relations for random sequences is presented in 
Table 8.4-1. 
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Table 8.4-1 Input/Output Relations for WSS Sequences and Linear 


Systems 
Random Sequence: Output Mean: 
Y [n] = hin] + X[n] Hy = H(0)ux 
Crosscorrelations: Cross-Power Spectral Densities: 
Rxy [m] = Rxx[m] * h*(—m] Sxy (w) = Sx x (w).H*(w) 
Ry x [m] = him) * Rxx([m] Sy x(w) = H(w)Sxx(w) 
Ryy [m] = Ryx Im] *h* [-m] Syy (w) = Syx (w)H* (w) 
Autocorrelation: Power Spectral Density: 
Ryy[m] = hm] * h*[—m] * Rxx [m] Syy (w) = |H(w)|?Sxx(w) 


= g{m] * Rxx [m] = G(w) Sxx(w) 
Output Power and Variance: 


EX|Y [n]?} = Ryy [0] = ge JIT |H(w)[?Sxx (w)dw 


oy = Ryy(0] — uy}? 


Synthesis of Random Sequences and Discrete-Time Simulation 


Here we consider the problem of finding the appropriate transfer function H{w) to generate 
a random sequence with a specified psd or correlation function. Consider Equation 8.2-2, 
repeated here for convenience: 


M N 
y[n] = 5 akyfn — k] + > bkz|n — k], (8.4-5) 
k=1 k=0 
where the coefficients are real valued. The transfer function H(w) is given by 
_ Y@) _ Be) 
HO) = Xo) T Ale)’ 


where B(w) £ yey, brei“ and A(w) £ 1— Si a,e-5#*, When driven by a white-noise 
sequence, W [n], with power E{|W[n]|?} = o7,, the output psd, Syy (w), is given by 
|B(w)|? > _ B(w)B*(w) > 





Now, recalling that B(z) Ê B(w) at z = e and similarly for A(z), H(z), that e~J¢ = z7}, 
and that B*(e/”) = B(e~%”), we obtain an LCCDE with real coefficientst 


tOnly when, as here, the impulse response coefficients are real valued. This is true here since the 
numerator and denominator coefficients are real numbers. 
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B(z)B(z™*) 2 _ 


Aaa Iw = HAG ow, (8.4-7) 


Syy (z) = 
where up to this point we have confined z to the unit circle. For the purpose of further 
analysis, it is of interest to extend Equation 8.4-7 to the whole z-plane. 

This last step is called analytic continuation and simply amounts to finding a rational 
function of z which agrees with the given psd information on the unit circle z = ef”. 

Given any rational Sx x(z), that is, one with a finite number of poles and zeros in the 
finite z-plane, one can find such a spectral factorization as Equation 8.4-7 by defining H(z) 
to have all the poles and zeros that are inside the unit circle, {|z| < 1}, and then H(z~') 
will necessarily have all the poles and zeros outside the unit circle, {|z| > 1}. 


Example 8.4-5 


Consider the psd 
2 
o 
S = —_ Ww ith 1. 
xx (w) l-jposwte =O lel < 


We want to first extend Sx x (w) to all of the z-plane. Now cosw = 4(e+I® + e73”), which 
can be extended as (z + z+) and satisfies the symmetry condition Sxx(z) = Sxx(z7!) 


of a real-valued random sequence. Then 


Ow 
1— p(z + 271) + p? 





1 
for |p| < |z] < — 


lal’ 


Sxx(z) = 


_ ow 
~ (1 = pz)(1 = p27?) 
= of/H(z)H(z~*) 


1 
with H(z) = ————— for region of convergence |p| < |z]. 
1— pz! 


Since |p| < 1, the region of convergence (ROC) includes the unit circle and so H is 
both stable and causal. Indeed the system with h[n] = p”u[n] will yield Sxx(w) from an 
independent sequence. 


If a zero occurs on the unit circle, then it must be of even order, since otherwise one 
can easily show that Sx x (e7”) must go through zero and hence be negative in its vicinity. 
Thus, we can assign half the zeros to H(z) and the other half to H(z~*). Since H(z) contains 
only poles inside the unit circle, it will be BIBO stable [8-5]. Except in the case of a zero 
on the unit circle, its inverse will also be stable. The other factor H(z—') has all its poles 
outside the unit circle, so it is stable in the anticausal sense. Denoting the largest pole 
magnitude inside the unit circle by pmax, we thus have that Sx x(z) is analytic, that is, free 
of singularities in the annular region of convergence {Pmax < |z| < 1/Dmax}- 

Following the above procedures, we can obtain the system function H(z) that, when 
driven by a white noise W[n], will generate a random sequence X[n] with special psd 
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Sxx(w). This can be the basis for a discrete-time simulation on a computer. The white 
random sequence W [n] is easily obtained by using the computer’s random number generator. 
Then one specifies appropriate initial conditions and proceeds to recursively calculate X [n] 
using the LCCDE of the system function H(z). 

To achieve a Gaussian distribution for X, one could transform the output of the random 
number generator to achieve a Gaussian distribution for W, which would carry across to X. 
An approximate method that is often used is to average six to ten calls to the random. 
number generator to obtain an approximate Gaussian distribution for W via the Central 
Limit theorem. When simulating a non-Gaussian random variable, the distribution for X 
and W is not the same. Thus the preceding method will not work. One possibility is to use 
the LCCDE to generate samples of W [n] from some real data and then use the resulting 
distribution for W [n] in the simulation. 


Example 8.4-6 
(matching given correlation values) In order to simulate a zero-mean random sequence with 
average power Rx x[0] = o? and nearest neighbor correlation Rxx[1] = po”, we want to 
find the parameters of a first-order stochastic difference equation to achieve these values. 
Thus consider 





X[n] = aX[n — 1] + bW fn], (8.4-8) 


where W[n] is a zero-mean white-noise source with unit power. Computing the impulse 
response, we get 
h[n] = ba” u[n] 


and the corresponding system function 


b 


HO = rat 


Since the mean is zero, we calculate the covariance of the output X|n] of Equation 8.4-8: 
Kxx[m] = him] x h[—m] « Kww[m] 
= him] « h[—m] 
= b? (a™ufm]) * (a~™u[—m]) 
+00 
=b? 5 a®ulkla™t ulm + k] 
k=—co 
+00 
— ba” 5 a2* 
k=max(0,—m) 
b2 


= al, = < m < +. 
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From the specifications at m = 0 and m = 1, we need 
Kxx(0] = o? = b7/(1 — a’), 
Kxx[1] = po? = ab? /(1 — a”). 


Thus, 
a = p and b? = 07(1 — p°). 


To compute the resulting psd, we use Equation 8.4-4 to get 


b2 
OOS T ap 
o? (1 — p*) 


= 1—2pcosw + p? 


Example 8.4-7 
(decimation and interpolation) Let X[n] be a WSS random sequence. We consider what 
happens to its stationarity and psd when we subject it to decimation or interpolation as 
occur in many signal processing systems. 





Decimation 


Set Y |n] âx [2n], called decimation by the factor 2, thus throwing away every odd indexed 


sample of X |n] (Figure 8.4-2). We easily calculate the mean function as py [n] SE {Y|[n]} = 
E{X[(2n]} = ux [2n] = wx, a constant. For the correlation, 


Ryy [n+ m,n] = E{X[2n + 2m]X*[2n]} 
= Rxx[2n + 2m, 2n] 
= Rxx [2m], 


thus showing that the WSS property of the original random sequence X[n] is preserved in 
the decimated random sequence. The psd of Y[n] can be computed as 


+00 
Syy(w) Ê J. Ryy[m]exp[-jwm] 


m=—oo 


+00 
$ Rxx [2m] exp [jum] 


mo=—oo 


D Rxx [m] exp |-i$m| = D Rx x [m] exp [35m (-1)™. 


m even m even 
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Figure 8.4-2 In decimation every other value of X[n] is discarded. 





Figure 8.4-3 In interpolation, the expansion step inserts zeros between adjacent values of the X[n] 
sequence, to get the expanded sequence X,[n]. 


Now, define Ae £ Syy (w), and Ao = En oaa Rxx [ml] exp[—j¥m]. Then, clearly Ae +A = 
Sxx(%) and A, — Ag = Sxx(*5%), so that 


Syy(w) = [sxx (3) + Sxx (= =] , 


which displays an aliasing [8-5] of higher-frequency terms. 








Interpolation 


For interpolation by the factor 2, we do the opposite of decimation. First we perform an 
expansion. by setting 


0, n=odd. 


The resulting expanded random sequence is clearly nonstationary, because of the zero 
insertions. See Figure 8.4-3. Formally the psd of X.[n] doesn’t exist since the psd is defined 
only for WSS sequences (Figure 8.4-4). We encounter such problems with a broad class 
of random sequences and processes! classified as being cyclostationary (cf. Section 9.6) to 
which Xe[n] belongs. It is easy to convert such sequences to WSS by randomizing their 
start times and then averaging over the start time (Example 9.6-1). However, here we 
instead compute the power spectral density using Equation 8.4-4, which is permissible for 
cyclostationary waveforms. Thus we write 


Xen ê a (x13) n = even 


t Random processes are continuous-time random waveforms to be discussed in Chapter 9. 
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(a) 


Figure 8.4-4 (a) The original psd of X[n]; (b) the psd of X.[n] (not drawn to scale). Note the “leakage” 
of power density from the secondary periods into the primary period. An ideal lowpass filter with support 
[—72/2, 2/2] will eliminate the contribution from the secondary periods. 


2 


N 
E{\Yn(w)|?} = E > X.[nje7s“" 


n=—N 








and take the limit of E{|xo™) (w)|?}/(2N +1) as N — oo. This quantity can be interpreted 
as the psd, Sx,x,(w), of the random sequence X,[n]. If the algebra is carried out and we 
assume that Rxx|m] is absolutely summable, we find that Sx,x,(w) = 48x x(2w). For 
further discussion of the expansion step, see Problems 8.58 and 8.59. 

Next we put X,[n], sometimes called an upsampled version of X[n], through an ideal 
lowpass filter with bandwidth [—},+ 4] and gain of 2, to produce the “ideal” interpolated 
output Y[n] as 

Y[n] = © hin] * Xeln]. 
The impulse response of such a filter is 
hîn] = sin(rn/2) 
(xn/2) 


Thus, 


H k)nr/2 
vn- So xie Bal? 


k=—00 


sin(n — 2k)x/2 
-¥ x [k] (n- oan 


k=—0o 


First we calculate the mean function of Y [n], 


uyin] Ê E{Y [n}} 


ee 2k)r/2 
-ef ¥ Xea ayn /2 (n — 2k)x/2 } 


k=— 00 
+00 : 
_ sin(n — 2k)x/2 
= R23 nx k a akn/2 


sin(n — 2k)r/2 
~ Bx È (n — 2k)nr/2 ’ 
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the last step being allowed since py is a constant. Now sampling theory can be used to show 
that the infinite sum is 1, so that py [n] = uy. To see this we write the sampling theorem 
representation for an arbitary bandlimited function g(t) and sampling period T = 2 as 
in [8-5]: 


+00 . _ 
gt)= >> sen a (8.4-9) 


k=- 


Then we simply choose t = n for the bandlimited function g(t) = 1 with zero bandwidth to 
see that 


+00 |. 
i= 3 sin(n — 2k)r/2 
po, (n-2k)r/2 
For future reference we define h(t) Ê sinm/2 To find the correlation function, we 


proceed to calculate 
Ryy[n + m,n] = E{Y [n + m]¥*[n]} 


+0 
= So E{X[k]X*[ko]}h[n + m — 2ki][h[n — 2k] 


kı „k =— 00 


= J Rxxlh] $ hlntm-h—-bjhinth -b] 


lı=even lg=even 
+ So Rxx] SO hin+m-h -bjhint+h — l] 
l =odd l2=0dd 


with lı 4 kı — ko and lz 4 kı + ke and lg + lı even. We can evaluate the sums 


> A[n +m— h — ləļh|n + lL — l2] 


l2=even or odd 


by letting g(t) = A(t) in Equation 8.4-9 and allowing t to take the value t = m. We find 
that each sum, both the even and odd, equals A[m — 2lı]. Thus, 


Ryy[m+n,n] = Ryy[m] = > Rxx|hlh[m — 2l]. 
l 


We thus see that Y [n] is WSS, that Ryy [m] interpolates Rx x[m], that is, 


Ry y [2m] = D Rxx([h]h[2m = 21] 
h 


= Rxx[2n], 
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and calculating the psd 
Syy(w) = 5 Ryy|[mJe““™ 


= > 5 Rx xl[h]h[m — 2l; ]e 77 


m li 


= 30 Rxx{h] $ hlm - 2hje 
n = 


= $ Rxx[hJH we 
-h 


= H(w)Sx x (2w) 
2Sxx(2w), w| < 7/2, 
= 0, s < jol < r’ 





8.5 MARKOV RANDOM SEQUENCES 


We have already encountered some examples of Markov random sequences. Such sequences 
were loosely said to have a memory and to possess a state. Here we make these concepts 
more precise. We start with a definition. 


Definition 8.5-1 (Markov random sequence) 
(a) A continuous-valued Markov random sequence X [n], defined for n > 0, satisfies the 
conditional pdf expression 
fx(Entk|En, En~i,- - -3 20) = fx (En+klEn) 


for all £o,..., Zn, &n+k, for all n > 0, and for all integers k > 1. 
(b) A discrete-valued Markov random sequence X|n], defined for n > 0, satisfies the 
conditional PMF expression 


Px (£n+k|£n,-- - , £0) = Px (En+k|£n) 
for all £o,..-, Zn, £n+ķ, for all n > 0, and for all k > 1. EB 


It is sufficient for the above properties to hold for just k = 1, which is the so-called 
one-step case, as the general property can be built up from it. The discrete-valued Markov 
random sequence is also called a Markov chain and will be covered in the next section. Here 
we consider the continuous-valued case. 

To check the meaning and usefulness of the Markov concept, consider the general Nth- 
order pdf fx(TN,£N—1,--., z0) of random sequence X, and repeatedly use conditioning to 
obtain the chain rule of probability 


fx (20,21, wee TN) = fx(z0)fx(z1ı|£0)fx (£2421, £0) os -fx(tn|en-1, wee , Z0). (8.5-1) 
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Now substitute the basic one-step (k = 1) version of the Markov definition to obtain 


fx (xo, 21,---,¢n) = fx (z0) fx (z1|£0)fx(z2|z1) .-. fx(tn|zn-1) 


N 
= fx(zo) [J fx(celee-1)- 


k=1 


Next we present two examples of continuous-valued Markov random sequences which 
are Gaussian distributed. 


Example 8.5-1 SSS 
(Gauss Markov random sequence) Let X [n] be a random sequence defined for n > 1, with 
initial pdf 

fx (230) = N(0, 00) 


for a given co > 0 and transition pdf 
fx(EnlEn-1;n, n — 1) ~ N(ptp_1, 0%) 


with |p| < 1 and ow > 0. We want to determine the unconditional density of X [n] at an 
arbitrary time n > 1 and proceed as follows. 

In general, one would have to advance recursively from the initial density by performing 
the integrals (cf. Equation 2.6-84) 


+00 
fx(z3n) = J fx(zlé;n,n— 1) fx (€;n — dé (8.5-2) 


for n = 1,2,3, and so forth. However, in this example we know that the unconditional 
first-order density will be Gaussian because each of the pdf’s in Equation 8.5-2 is Gaussian, 
and the Gaussian density “reproduces itself” in this context; that is, the product of two 
exponential functions is still exponential. Hence the pdf fx (z;n) is determined by its first 
two moments. We first calculate the mean function 


nxin] = E{X|n]} 
= E[E{X[n]|X[n — 1]}] 
= ElpX|n — ]]] 
= pux[n— 1], 


where the outer expectation is over the values of X[n — 1]. We thus obtain the recursive 
equation 


bx [n] = pux|n- 1], n21, 
with prescribed initial condition 4x [0] = 0. Hence y(n] = 0 for all n. 


We also need the variance function oĉ [n], which in this case is just E[X?[n]] since the 
mean is zero. Calculating, we obtain 
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E{X?|n]} = E[E{X?|[n]|X[n — 1)} 
= Eloy + eX? [n — 1] 
= oy + pE{X?[n — 1]} 


or 
a [n] = PoR in — 1] + o%), n> 1. 
This is a first-order difference equation, which can be solved for 0% [n] given the condition 
o% [0] = o2 supplied by the initial pdf. The solution then is 
ozin] = [1 +p? + pt +... poy + p03 
l 2 
=> 12I w asn >œ. 
Example 8.5-2 
(Markov difference equation) Consider the difference equation 
Xin] = pX[n — 1] + W[nl, 


where W [n] is an independent random sequence (cf. Definition 8.1-2). Let n > 0; then 


fx (En, En—-15 +++ 0) = fx (Ln|2n-1) fx (En-1|2n-2) - - - fx (2120) fw (z0) 
= (I fw (rr - po.) fw (zo), 
k=1 


where z[n] = £n and w([n| = wn are the sample function values taken on by the random 
sequences X [n] and W [n], respectively. Clearly X[n] is a Markov random sequence. If W [n] 
is an independent and Gaussian random sequence, then this is just the case of Example 8.5-1 
above. Otherwise, the Markov sequence X [n] will be non-Gaussian. 


The Markov property can be generalized to cover higher-order dependence and higher- 
order difference equations, thus extending the direct dependence concept to more than one- 
sample distance. 


Definition 8.5-2  (Markov-p random sequence) Let the positive integer p be called the 
order of the Markov-p random sequence. A continuous-valued Markov-p random sequence 
X [n], defined for n > 0, satisfies the conditional pdf equations 


fx(En+k|En, En-1,.--, 20) = fx (fntel€n,2n-1,---,Ln—p41) 
for all k > 1 and for all n > p. E 


Returning to look at Equation 8.5-1, we can see that as the Markov order p increases, the 
modeling error in approximating a general random sequence by a Markov random sequence 
should get better. 
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fx (0,71, nae ZN) 
= fx (xo) fx(z1|£0)fx(z2|z1, 20)... fx(EN|EN-1,. --, £0) 


~ fx(zo)fx(z1ı|£0)fx(22|£1, 20)... fx(£pl£p-1;-- - Zo) 


x Il fx (@e|2e-1,---,2e—p+1)- 
k=p+1 


This approximation would be expected to hold for the usual case where the strongest 
dependence is on the nearby values, say X [n — 1] and X[n — 2], with the conditional depen- 
dence on far away values being generally negligible. When the Markov-p model is used in 
signal processing, one of the most important issues is determining an appropriate model 
order p so that statistics like the joint pdf’s (Equation 8.5-1) of the original data are 
adequately approximated by those of the Markov-p model. In Chapter 11 on applications in 
statistical signal processing, we will see that Markov-p random sequences are quite useful in 
modern spectral estimation. The celebrated Kalman filter for the recursive linear estimation 
of distorted signals in noise is based on the Markov models. 


ARMA Models 


A class of linear constant coefficient difference equation models are called ARMA for auto- 
regressive moving average. Here the input is an independent random sequence W [n] with 
mean 4w = 0 and variance of, = 1. The LCCDE model then takes the form 


Xin] =$ ap X[n — k] + $ Wn — k]. 


If the model is BIBO stable and —oo < n < +00, then a WSS output sequence results 


with psd 2 


L 
2 p exp(—jwk) 
k= 


M 
— > ap exp(—jwk) 
k=1 
The ARMA sequence is not Markov, but when L = 0, the sequence is Markov-M, and 
the resulting model is called autoregressive (AR). On the other hand when M = 0, that is, 
there are no feedback coefficients cp, the equation becomes just 








Sxx(w) = z- 








L 
Xin] =$ d,W{n — k], 


and the model is called moving average (MA). The MA model is often used to estimate the 
time-average value over a data window, as shown in the next example. 
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Example 8.5-3 — > o 
(running time average) Consider a sequence of independent random variables W [|n] on 
n > 1. Denote their running time average as 


âwln] = Ż XW]. 
k=1 


Since we can write fw [n] equivalently as satisfying the time-varying AR. equation, 


n-—1 1 
~ = 213pm- 1]+ twin, 
iwin] = awh- 1] + tw] 
it follows from the joint independence of the input W{n] that fw[n] is a nonstationary 
Markov random sequence.t 





Markov Chains 


A Markov random sequence can take on either continuous or discrete values and then be 
represented either by probability density functions (pdf’s) or probability mass functions 
(PMFs) accordingly. In the discrete-valued case, we call the random sequence a Markov 
chain. Applications occur in buffer occupancy, computer networks, and discrete-time approx- 
imate models for the continuous-time Markov chains (cf. Chapter 9). 


Definition 8.5-3 (Markov chain) A discrete-time Markov chain is a random sequence 
X [n] whose Nth-order conditional PMFs satisfy 


Px(z[n]|z[n — 1],...,2[n — N]) = Px(z[n]|z[n — 1]) (8.5-3) 
for all n, for all values of z[k], and for all integers N > 1. E 


The value of X[n] at time n is called “the state.” This is because this current value, 
that is, the value at time n, determines future conditional PMFs, independent of the past 
values taken on by X[n]. 

A practical case of great importance is when the range of values taken on by X[n] is 
finite, say M. The discrete range of X[n], that is, the values that X takes on, is sometimes 
referred to as a set of labels. The usual choices for the label set are either the integers 
{1, M}, or {0, M — 1}. Such a Markov chain is said to have a finite state space, or is simply 
a finite-state Markov chain. In this case, and when the random sequence is stationary, we 
can represent the statistical transition information in a matrix P with entries 


Pij = P| xn- (Gli), (8.5-4) 


for 1 < i, j < M. The matrix P is referred to as the state-transition matriz. Its defining 
property is that it is a matrix with nonnegative entries, whose rows sum to 1. Usually, and 


tNote that the variance of Aw [n] decreases with n. 
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again without loss of generality, we can consider that the Markov chain starts at time index 
n = 0. Then we must specify the set of initial probabilities of the states at n = 0, that is, 
Px (i;0), 1 <i < M, which can be stored in the initial probability vector p[0], a row vector 
with elements (p[0]); = Px(i;0),1<i< M. 

The following example re-introduces the useful concept of state-transition diagram, 
already seen in Example 8.1-15. 


Example 8.5-4 — ~ S S 
(two-state Markov chain) Let M = 2; then we can summarize transition probability infor- 
mation about a two-state Markov chain in Figure 8.5-1. The only addition needed is the set 
of initial probabilities, Px (1;0) and Px (2; 0). 





Possible questions might be: Given that we are in state 1 at time 4, what is the prob- 
ability we end up in state 2 at time 6? Or given a certain probability distribution over the 
two states at time 3, what is the probability distribution over the two states at time 7? 
Note that there are several ways or paths to go from one state at one time to another state 
several time units later. The answers to these questions thus will involve a summation over 
these mutually exclusive outcomes. 

Here we have M = 2, and the two-element probability row vector p|n] = (po[n], pi[n]). 
Using the state-transition matrix, we then have 


p[1] = p[0]P 
p[2] = p[1]P 
= p[0|P? 
or, in general, 
pin] = p[o]P”. 


In a statistical steady state, if one exists, we would have 


ploo] = ploolP, where ploo] = lim pln. 


Writing p = ploc], we have p(I — P) = 0, which furnishes M — 1 independent linear 
equations. Then with help of the additiona] equation p1 = 1, where 1 is a size M column 


Priz 


Figure 8.5-1 The state-transition diagram of a two-state Markov chain. 
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vector of all ones, we can solve for the M values in p. The existence of a steady state, or 
equivalently asymptotic stationarity, will depend on the eigenvalues of the state-transition 
matrix P. 


Example 8.5-5 — SSS 
(asymmetric two-state Markov chain) Here we consider an example of a two-state, asym- 
metric Markov chain (AMC), with state labels 0 and 1, and state-transition matrix, 


P= Poo Pol — 0.9 0.1 
Pio P11 0.2 0.8) ° 
See Figure 8.5-2. 
Note that in this model there is no requirement that poo = p11 and the steady-state 
probabilities, if they exist, are given by the solution of 


pin + 1] = p[n]P, (8.5-5) 


if we let n — co. Denoting these probabilities by po[oo] and pı[00], and using poloo] + 
pi[co] = 1, we obtain 


1-pu 
oo] = —————__, 
poloo] 2 — Poo — P11 
1 — poo 
oo] = ——— =, 
pı [oc] 2 — poo — Pir 


which, using the P matrix from Example 8.5-4, yields po[oo] = 2 and [00] = L, 
The steady-state autocorrelation function of the AMC of this example can be computed 
from the Markov state probabilities. For example, assuming asymptotic stationarity, 
Rxx|m] ~ P{X[k] = 1, X[m + k] =1} for sufficiently large k 
= P{ X[k] = 1} P{X[m + k] = 1|X[k] = 1} (8.5-6) 
= py[oo] P{X[m] = 1|X[0] = 1}, 


C > 


Figure 8.5-2 State-transition diagram of general (asymmetric) two-state Markov random sequence, 
with state labels 0 and 1. 
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where the last factor is an m-step transition from state 1 to state 1. It can be computed 
recursively from Equation 8.5-5, with the initial condition p[0] = [0,1]. The needed compu- 
tation can also be illustrated in a trellis diagram as seen in the following example. 


Example 8.5-6 — SS 
(trellis diagram for Markov chain) Consider once again Example 8.1-15, where we intro- 
duced the state-transition diagram for what we now know as a Markov chain. Another 
useful diagram that shows allowable paths to reach a certain state, and the probability of 
those paths, is the trellis diagram, named for its resemblance to the common wooden garden 
trellis that supports some plants. See Figure 8.5-3 for the two-state case having labels 0 and 
1, which also assumes symmetry, that is, pi; = p;;. We see that this trellis is a collapsing 
of the more general tree diagram of Example 8.1-4. The collapse of the tree to the trellis is 
permitted because of the Markov condition on the conditional probabilities, that serve as 
the branch labels. 

Each node represents the state at a given time instant. The node value (label) is its 
probability at time n. The links (directed branches) denote possible transitions and are 
labeled with their respective transition probabilities. Paths through the trellis then repre- 
sent allowable multiple time-step transitions, with probability given as the product of the 
transition probabilities along the path. 

If we know that the chain is in state one at time n = 0, then the modified trellis 
diagram simplifies to that of Figure 8.5-4, where we have labeled the state 1 nodes with 


State 1 


LLII 


Pol0l pol] Pol21] Pol3] Pol 4) Pol} 


n=0 n=1 n=2 n=3 n=4 n=5 





Figure 8.5-3 A trellis diagram of a two-state symmetric Markov chain with state labels 0 and 1. Here 
p,[n] is the probability of being in state j at time n. 





“Oem: 


Figure 8.5-4 Trellis diagram conditioned on X[0] = 1. 
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P, ê P{X[n] = 1|X[0] = 1}, and we can use this trellis to calculate the probabilities 
P{X [n] = 1|X [0] = 1} needed in Equation 8.5-6. The first few Pp values are easily calculated 
as P) = p, P = p? + q?, Ps = p? + 3pq’, etc. For the case po[oo] = pi[oo] = $, and p = 
0.8, the asymptotically stationary autocorrelation (ASA) function Rxx[m] then becomes 
Rxx [0] = 0.5, Rx x [+1] = 0.4, Rx x [+2] = 0.34, Rx x [+3] = 0.304, and so forth.t 


The trellis diagram shows that, except in trivial cases, there are many allowable paths 
to reach a certain node, that is, a given state at a given time. This raises the question of 
which path is most probable (most likely) to make the required multistep traversal. In the 
previous example, and with p > q, it is just a matter of finding the path with the most 
p’s. In general, however, finding the most likely path is a time-consuming problem and, if 
left to trial-and-error techniques, would quickly exhaust the capabilities of most computers. 
Much research has been done on this problem because of its many engineéring applications, 
one being speech recognition by computer. In Chapter 11, we discuss the efficient Viterbi 
algorithm for finding the most likely path. " 


Example 8.5-7 ——————————— 
(buffer fullness) Consider the Markov chain as a model for a communications buffer with 
M +1 states, with labels 0 to M indicating buffer fullness. In other words, the state label is 
the number of bytes currently stored in the M byte capacity buffer. Assume that transitions 
can occur only between neighboring states; that is, the fullness can change at most by one 
byte in each time unit. The state-transition diagram then appears as shown in Figure 8.5-5. 





If we let M go to infinity in Example 8.5-7, we have what is called the general birth- 
death chain, which was first used to model the size of a population over time. In each time 
unit, there can be at most one birth and at most one death. 


Solving the equations. Consider a two-state Markov chain with transition probability 


matrix 
P= Poo Poi , 
Pio Pil 





Pio Pu Pan P2 è P2 p.. Pme 


Figure 8.5-5 Markov chain model for M + 1 state communications buffer. 


— ’ 

tThe ASA is computed as Rx x [m] = E{X[k+m]X [k]}, where k — oo. For levels of 0 and 1, Rx x [m] = 
P{X[m + k] = 1|X[k] = 1} x 0.5. Then clearly Rxx [0] = 1x 0.5 = 0.5, Rxx[1] = 0.8 x 0.5 = 04, 
Rx x [2] = [(0.8)? + (0.2)?] x 0.5 = 0.34, ete. 
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We can write the equation relating p[n] and p[n + 1] then as follows: 


[pon +1], piln+1]] = [poln], pn] [Poe Pos) (8.5-7) 


This vector equation is equivalent to two scalar difference equations which we have to 
solve together, that is, two simultaneous difference equations. We try a solution of the form 


poln] = Coz”, piln] = C12”. 


Inserting this attempted solution into Equation 8.5-7 and canceling the common term 
z”, we obtain 


Coz = Copoo + CiPi0, 
Cız = Copa + Cipi, 


which implies the following necessary conditions: 


Ci = Co (=) =C ( Poi ) ; 
P10 2—Ppi 
This gives a constraint relation between the constants Co and C as well as a necessary 
condition on z, the latter being called the characteristic equation 





(z — poo)(z — p11) — PioPor = 0. 


It turns out that the characteristic equation (CE) can be written, using the determinant 
function, as _ 
det(zI — P) = 0. 


Solving our two-state equation, we obtain just two solutions zı and z2, one of which 
must equal 1. (Can you see this? Note that 1 — poo = po.) The solutions we have obtained 
thus far can be written as 


Zi — Poo . 
poln] = Coz, pi [n] = Co (=) zii = 1,2. 
P10 
Since the vector difference equation is linear, we can add the two solutions corresponding 
to the different values of z;, to get the general solution, written in row vector form 


pin] = Ai E #1 ~ Poo Pee] zy + Ag E 22 ~ Poo = Poe] z3, 
Pio Pio 
where we have introduced two new constants A; and Ax for each of the two linearly inde- 
pendent solutions. These two constants must be evaluated from the initial probability vector 
p(0] and the necessary conditions on the probability row vector at time index n, that is, 
9 pln] = 1 for all n > 0. 


522 Chapter 8 Random Sequences 





Example 8.5-8 
(complete solution) Let 


P= [z HE with p[0] = [1/2, 1/2], 


and solve for the complete solution, including the startup transient and the steady-state 
values for p[n]. 


The first step is to find the eigenvalues of P, which are the roots of the characteristic 
equation (CE) 


z— 0.9 -0.1 


det(zI — P) = det ( 02 z—08 


) =z? —1.7z+0.7=0. 


This gives roots zı = 0.7 and z2 = 1.0. Thus, we can write 
pin] = C1 [1, —1] 0.7" + C2[1, 0.5] 1”. 


From steady-state requirement that the components of p sum to 1.0, we get C2 = 2. 
So we can further write 
pin] = Ci[1, -1] 0.7” + [2, 3]. 


Finally we invoke the specified initial conditions p{0] = [1/2,1/2] to obtain C, = — 
and 


1 
é 
plin] = [-é, g] 0.7" + (2, 3], or in scalar form, 

poln] = —§ 0.7" + 3 


1 1 for n > 0. 
piln] = 50.7" + 3 
Here we see that the steady-state probabilities exist and are po[oo] = 2 and p;(oo] = 3. 
The next example shows that such steady-state probabilities do not always exist. 


Example 8.5-9 — eee 
(ping pong) Consider the two-state Markov chain with transition probability matrix 


P= 1 x . The characteristic equation becomes 


det(zI — P) = det (4, 7) =z —1=0, 


with two roots 21,2 = +1. Thus there is no steady state in this case, even though one 
of the eigenvalues of P is 1. Indeed, direct from the state-transition diagram, we can see 
that the random sequence will forever cycle back and forth between states 0 and 1 with 
each successive time tick. The phase can be irrevocably set by the initial probability vector 
po] = [1, 0]. . 
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While we cannot always assume a steady state exists, note that this example is degen- 
erate in that the transition probabilities into and out of states are either 0 or 1. Another 
problem for existence of the steady state is a so-called trapping state. This is a state with 
transitions into, but not out of, itself. In most cases of interest in communications and signal 
processing, a steady state will exist, independent of where the chain starts. 


8.6 VECTOR RANDOM SEQUENCES AND STATE EQUATIONS 


The scalar random sequence concepts we have seen thus far can be extended to vector 
random sequences. They are used in Chapter 11 to derive linear estimators for signals in 
noise (Kalman filter). They are also used in models of sensor arrays, for example, seismic, 
acoustic, and radar. This section will introduce difference equations for random vectors and 
the concept of vector Markov random sequence. Interestingly, a high-order Markov-p scalar 
random sequence can be represented as a first-order vector Markov sequence. 


Definition 8.6-1 A vector random sequence is a mapping from a probability sample 
space Q, corresponding to probability space (Q, % P), into the space of vector-valued 
sequences over complex numbers. W 


Thus for each Ç € Q and fixed time n, we generate a vector X(n,¢). The vector random 
sequence is usually written X(n], suppressing the outcome ¢. 
For example the first-order CDF for a random vector sequence X[n}, would be given as 


Fx (x;n) 4 P{X[n] < x}, 


where {X[n] < x} means every element satisfies the inequality, that is, {Xi [n] < z1, Xo[n] < 
£2,.--,;Xn|n] < ty}. Second- and higher-order probabilities would be specified accordingly. 
The vector random sequence is said to be statistically specified by the set of all its first- and 
higher-order CDFs (or pdf’s or PMFs). 
The following example treats the correlation analysis for a vector random sequence 
input to a vector LCCDE 
yin] = Ay[n — 1] + Bx{n], 


with N-dimensional coefficient matrices A and B. In this vector case, BIBO stability is 
assured when the eigenvalues of A are less than one in magnitude. 


Example 8.6-1 
(vector LCCDEs) In the vector case, the scalar first-order LCCDE model, excited by column 
vector random sequence X[n], becomes 





Y[n] = AY[n — 1] + BX[n], (8.6-1) 


which is a first-order vector difference equation in the sample vector sequences. The vector 
impulse response is the column vector sequence 
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h(n] = A” Bu/[n], 


and the zero initial-condition response to the sequence X[n] is 
n 
Y[n] = X A” *BX{k] 


h[n] * X[n]. 
The matrix system function is 
H(z) = (I— Az!)~'B, 


as can easily be verified. The WSS cross-correlation matrices Ryx [m] £ E{Y [n+ m]X? [n]} 


(where the “Į” indicates the Hermitian (or conjugate) transpose) and Rxy[m] 4 
E{X[n + m]¥'[n]}, between an input WSS random vector sequence and its WSS output 
random sequence, become 


Ryx [m] = him] * Rxx [m], 
Rxy [m] = Rxx [m] * hi [—m]. 


Parenthetically, we note that for a causal h, such as would arise from recursive solution 
of the above vector LCCDE, we have the output Y [n] uncorrelated with the future values 
of the input X[n], when the input X = W is assumed a white noise vector sequence. 

The output correlation matrix is 


Ryy [m] = h[m] x Rxx [m] * ht [-m] 
and the output psd matrix becomes upon Fourier transformation 
Syy (w) = H(w)Sxx(w)H! (w). 
The total solution of Equation 8.6-1 for any n > no can be written as 
Yin] = A” Y [no] + X hin —k]X[k]), n > no 
k=no 


in terms of the initial condition Y [no] that must be specified at no. In the limit as no —> —oo, 
and for a stable system matrix A, this then becomes the convolution summation 


Y[n] = hjn] + X[n], —co < n < +00. 





Definition 8.6-2 A vector random sequence Y[n] is vector Markov if for all K > 0 
and for all ng > ng- >... > nı, we have 


P{Y[nx] < yxly[nx-1],---,y[mi]} = P{Y [nk] < yxlyink-]} 
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for all real values of the vector yx, and all conditioning vectors y[nx-1],..-,y[na]. (cf. 
Definition 8.5-2 of Markov-p property.) W 


We can now state the following theorem for vector random sequences: 
Theorem 8.6-1 In the state equation 
X[n] = AX[n — 1] + BW[n], for n > 0, with X[0] = 0, 


let the input W[n] be a white Gaussian random sequence. Then the output X[n] for n > 0 
is a vector Markov random sequence. 


The proof is left to the reader as an exercise. [ll 


Example 8.6-2 
(relation between scalar Markov-p and vector Markov) Let X[n] be a Markov-p random 
sequence satisfying the pth order difference equation 





X[n] = a, X[n -— 1] +... + apX[n — p] + bW [n]. 


Defining the p-dimensional vector random sequence X[n] = [X[n],...,X[n —p + 1]]’, 
and coefficient matrix 
ay ag eae ap 
1 0 0 0 
A=-|0 1. à |, 
`. 0 
0 0 1 0 
we have 
Xin] = AX[n — 1] + bW fn]. 
Thus X[n] is a vector Markov random sequence with b = [6,0,...,0]7. Such a vector 


transformation of a scalar equation is called a state-variable representation. [8-7]. 





8.7 CONVERGENCE OF RANDOM SEQUENCES 


Some nonstationary random sequences may converge to a limit as the sequence index goes 
to infinity, for example as time becomes infinite. This asymptotic behavior is evidenced 
in probability theory by convergence of the fraction of successes in an infinite Bernoulli 
sequence, where the relevant theorems are called the laws of large numbers. Also, when 
we study the convergence of random processes in Chapter 10 we will sometimes make a 
sequence of finer and finer approximations to the output of a random system at a given 
time, say to, that is, Yn (to). The index n then defines a random sequence, which should 
converge in some sense to the true output. In this section we will look at several types of 
convergence for random sequences, that is, sequences of random variables. 
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We start by reviewing the concept of convergence for deterministic sequences. Let £n 
be a sequence of complex (or real) numbers; then convergence is defined as follows. 


Definition 8.7-1 A sequence of complex (or real) numbers z, converges to the 
complex (or real) number z if given any £ > 0, there exists an integer no such that whenever 
n > no, we have 

|En- r| <e m 


Note that in this definition the value no may depend on the value ¢; that is, when € 
is made smaller, most likely no will need to be made larger. Sometimes this dependence is 
formalized by writing ng(¢) in place of no in this definition. This is often written as 

lm¢g¢,=2 oras Tn `> T asn —> oœ. 


n> 


A practical problem with this definition is that one must have the limit x to test 
for convergence. For simple cases one can often guess what the limit is and then use the 
definition to verify that this limit indeed exists. Fortunately, for more complex situations 
there is an alternative in the Cauchy criterion for convergence, which we state as a theorem 
without proof. 


Theorem 8.7-1 (Cauchy criterion [8-8]) A sequence of complex (or real) numbers 
£n converges to a limit if and only if (iff) 


|En — £m| — 0 as both n and m — œ. 


The reason that this works for complex (or real) numbers is that the set of all complex (or 
real) numbers is complete, meaning that it contains all its limit points. For example, the 
set {0 < x < 1} = (0,1) is not complete, but the set {0 < x < 1} = [0,1] is complete 
because sequences £n in these sets and tending to 0 or 1 have a limit point in the set [0, 1] 
but have no limit point in the set (0,1). In fact, the set of all complex (or real) numbers 
is complete as well as n-dimensional linear vector spaces over both the real and complex 
number fields. Thus the Cauchy criterion for convergence applies in these cases also. For 
more on numerical convergence see [8-8]. 

Convergence for functions is defined using the concept of convergence of sequences of 
numbers. We say the sequence of functions f,(2) converges to the function f(z) if the 
corresponding sequence of numbers converges for each x. It is stated more formally in the 
following definition. 


Definition 8.7-2 The sequence of functions fn(x) converges (pointwise) to the func- 
tion f(x) if for each xo the sequence of complex numbers fn(zo) converges to f(zo). W 


- The Cauchy criterion for convergence applies to pointwise convergence of functions 
if the set of functions under consideration is complete. The set of continuous functions 
is not complete because a sequence of continuous functions may converge to a discontin- 
uous function (cf. item (d) in Example 8.7-1). However, the set of bounded functions is 
complete [8-8]. 

The following are some examples of convergent sequences of numbers and functions. 
We leave the demonstration of these results as exercises for the reader. 
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Example 8.7-1 
(some convergent sequences) 


(a) £n = (1 — 1/n)a + (1/n)b 3 a as n > oo. 

(b) £n = sin(w + e7”) > sinw as n — o. 

(c) falz) = sin|(w + 1/n)z] > sin(wx), as n — œ for any (fixed) x. 

(d) fr(z) = fe 7, for i Z o} — u(—2), as n — œ for any (fixed) z. 


The reader should note that in the convergence of the functions in (c) and (d), the variable z 
is held constant as the limit is being taken. The limit is then repeated for each such x value 
to find the limiting function. 





Since a random variable is a function, a sequence of random variables (also called a 
random sequence) is a sequence of functions. Thus, we can define the first and strongest 
type of convergence for random variables. 


Definition 8.7-3 (sure convergence) The random sequence X[n] converges surely to 
the random variable X if the sequence of functions X[n,¢] converges to the function X (C) 
as n — œ for all outcomes C E€ Q. E 


As a reminder, the functions X(C) are not arbitrary. They are random variables and 
thus satisfy the condition that the set {¢: X(¢) < x} c F for all z, that is, that this set 
be an event for all values of x. This is in fact necessary for the calculation of probability 
since the probability measure P is defined only for events. Such functions X are more 
generally called measurable functions and in a course on real analysis it is shown that the 
space of measurable functions is complete [8-1]. If we have a Cauchy sequence of measurable 
functions (random variables), then one can show that the limit function exists and is also 
measurable (a random variable). Thus, the Cauchy convergence criterion also applies for 
random variables. 

Most of the time we are not interested in precisely defining random variables for sets 
in 2 of probability zero because it is thought that these events will never occur. In this 
case, we can weaken the concept of sure convergence to the still very strong concept of 
almost-sure convergence. 


Definition 8.7-4 (almost-sure convergence) The random sequence X[n] converges 
almost surely to the random variable X if the sequence of functions X |n, ¢] converges for 
all outcomes Ç € Q except possibly on a set of probability zero. E 


‘This is the strongest type of convergence normally used in probability theory. It is also 
called probability-1 convergence. It is sometimes written 


P{ lim X{n,g)= x(O}=1, 


meaning simply that there is a set A such that P[A] = 1 and X[n] converges to X for all 


Ç € A. In particular A Ê {¢: limno X[n, ¢] = X(¢)}. Here the set A° is the probability- 
zero set mentioned in this definition. As shorthand notation we also use 
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X[n] > X as. and X[n]—> xX prl, 


where the abbreviation “a.s.” stands for almost surely, and “pr.1” stands for probability 1. 

An example of probability-1 convergence is the Strong Law of Large Numbers to be 
proved in the next section. Three examples of random sequences are next evaluated for 
possible convergence. 


Example 8.7-2 
(convergence of random sequences) For each of the following three random sequences, we 
assume that the probability space (9, % P) has sample space Q = [0,1]. ¥ is the family 
of Borel subsets of Q and the probability measure P is Lebesgue measure, which on a real 
interval (a, b] is just its length /, that is, - 





(a,b) b-a for b>a. 


(a) X[n,¢] = nç. 

(b) X[n, c] = sin(n¢). 

(c) X[n,¢] = exp[—n?(¢ — $) 

The sequence in (a) clearly diverges to +00 for any ¢ # 0. Thus this random sequence 
does not converge. The sequence in (b) does not diverge, but it oscillates between —1 and 
+1 except for the one point ¢ = 0. Thus this random sequence does not converge either. 

Considering the random sequence in (c), the graph in Figure 8.7-1 shows that this 
sequence converges as follows: 


. _fæœæforç=0 
dim, X fn, c] -{ 0 for ¢>0.° 


60 


=1, A 
è 


X{n,¢1, n 


10 


0 
0 01 02 03 04 05 06 07 08 09 1 


g 
Figure 8.7-1 Plot of sequence (c) X[n, ¢] versus ¢ for Q = [0,1] for n=1,...,4. 


Sec. 8.7. CONVERGENCE OF RANDOM SEQUENCES 529 





Thus, we can say that the random sequence converges to the (degenerate) random 
variable X = 0 with probability 1. We simply take A = (0, 1] and note that P[A] = 1 and 
that X[n,¢] ~ 0 for every ¢ in A for sufficiently large n. We write X[n] — 0 a.s. However, 
X [n] clearly does not converge surely to zero. , 


Thus far we have been discussing pointwise convergence of sequences of functions and 
random sequences. This is similar to considering a space of bounded functions .@ with the 
norm 


IIf loo = sup |f(z)|-1 


When we write fn — f in the function space .@, we mean that || fn — flo = sup, |fn(z) — 
f(x)| — 0, giving us pointwise convergence. The space of continuous bounded functions is 
denoted Læ and is known to be complete ([8-1], p. 115). 

Another type of function space of great practical interest uses the energy norm (cf. 


Equation 4.4-6): 
+00 1/2 
ie (f f(a)’ . 


The space of integrable (measurable) functions with finite energy norm is denoted L?. When 
we say a sequence of functions converges in L, that is, || fn — fll2 — 0, we mean that 


00 


ii |fa(£) — fla)Paz) 2 — 0 as n — œ. 


This space of integrable functions is also complete [8-1]. A corresponding concept for random 
sequences is given by mean-square convergence. 


Definition 8.7-5 (mean-square convergence) A random sequence X [n] converges in 
the mean-square sense to the random variable X if E{|X[n]— X|?} —> 0 as n —> œ. E 


This type of convergence depends only on the second-order properties of the random 
variables and is thus often easier to calculate than a.s. convergence. A second benefit of the 
mean-square type of convergence is that it is closely related to the physical concept of power. 
If X|n] converges to X in the mean-square sense, then we can expect that the variance of 


the error E[n] 2x [n] — X will be small for large n. If we look back at Example 8.7-2, (c), 
we can see that this random sequence does not converge in the mean-square sense, so that 
the error variance or power as defined here would not ever be expected to be small. To see 
this, consider possible mean-square convergence to zero (since X[n] — 0 a.s.), 


tThe supremum or sup operator is similar to the max operator. The supremum of a set of numbers is 
the smallest number greater than or equal to each number in the set, for example, sup{0 < z < 1} = 1. 
Note the difficulty with max in this example since 1 is not included in the open interval (0, 1); thus the max 
does not exist here! 
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E{|X[n] — 0°} = E{X[n]?} 


1 
f exp(—2n?C) exp 2nd 
0 


1 
exp(2n) f exp(—2n?<)d¢ 


2 
= exp(2n) | — œ asn— oo. 
Hence X [n] does not converge in the mean-square sense to 0. 

Still another type of convergence that we will consider is called convergence in proba- 
bility. It is weaker than probability-1 convergence and also weaker than mean-square conver- 
gence. This is the type of convergence displayed in the Weak Laws of Large Numbers to be 
discussed in the next section. It is defined as follows: 


Definition 8.7-6 (convergence in probability) Given the random sequence X [n] and 
the limiting random variable X, we say that X [n] converges in probability to X if for every 
e>0O, 

lim P||X|[n]—X|>.¢«]=0. E 
n—-CoO 


We sometimes write X[n] — X (p), where (p) denotes the type of convergence. Also conver- 
gence in probability is sometimes called p-convergence. 


One can use Chebyshev’s inequality (Theorem 4.4-1), P[|Y| > €] < E[|Y|?]/e? for e > 0, 
to show that mean-square convergence implies convergence in probability. For example, let 


yx [n] — X; then the preceding inequality becomes 
P{|X[n] — X| > e] < E ||X[n] — X1] /e?. 


Now mean-square convergence implies that the right-hand side goes to zero as n — oo, for 
any fixed £ > 0, which implies that the left-hand side must also go to zero, which is the 
definition of convergence in probability. Thus we have proved the following result. 


Theorem 8.7-2 Convergence of a random sequence in the mean-square sense implies 
convergence in probability. W 


The relation between convergence with probability 1 and convergence in probability is 
more subtle. The main difference between them can be seen by noting that the former talks 
about the probability of the limit while the latter talks about the limit of the probability. 
Further insight can be gained by noting that a.s. convergence is concerned with convergence 
of the entire sample sequences while p-convergence is concerned only with the convergence 
of the random variable at an individual n. That is to say, a.s. convergence is concerned with 
the joint events at an infinite number of times, while p-convergence is concerned with the 
simple event at time n, albeit large. One can prove the following theorem. 


Theorem 8.7-3 Convergence with probability 1 implies convergence in proba- 
bility. 
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Proof (adapted from Gnedenko [8-9].) Let X[n] — X a.s. and define the set A, 


TDs 
i 


NE C: |X[n +m, ¢] — X(¢)| < 1/k} 


Then it must be that P[A] = 1. To see this we note that A is the set of ¢ such that starting 
at some n and for all later n we have |X [n,¢] — X(¢)| < 1/k and furthermore this must hold 
for all k > 0. Thus, A is precisely the set of ¢ on which X[n, ¢] is convergent. So P[A] must 
be 1. Eventually for n large enough and 1/k small enough we get |X[n,¢] — X(C)| < £, and 
the error stays this small for all larger n. Thus, 


UF N {Xin +m] - xi<a| =: for all e > 0, 


n=l m=1 


which implies by the continuity of probability, 


lim p| Ñ Uxm+m-xi<e|=1 for all € > 0, 


m=i 
which in turn implies the greatly weakened result 
lim P[|X[n +m] -—X|<e]=1 for all € > 0, (8.7-1) 
noo 


which is equivalent to the definition of p-convergence. W 

Because of the gross weakening of the a.s. condition, that is, the enlargement of the set A 
in the foregoing proof, it can be seen that p-convergence does not imply a.s. convergence. 
We note in particular that Equation 8.7-1 may well be true even though no single sample 
sequence stays close to X for alln+m > n. This is in fact the key difference between these 
two types of convergence. 


Example 8.7-3 
(a convergent random sequence?) Define a random pulse sequence X [n] on n > 0 as follows: 
Set X[0] = 1. Then for the next two points set exactly one of the X[n]’s to 1, equally 
likely among the two points, and the other to 0. For the nezt three points set: exactly one 
of the X[n]’s to 1 equally likely among the three points and set the others to 0. Continue 
this procedure for the next four points, setting exactly one of the X[n]’s to 1 equally likely 
among the four points and the others to 0 and so forth. A sample function would look like 
Figure 8.7-2. 

Obviously this random sequence is slowly converging to zero in some sense as n — oo. 
In fact a simple calculation would show p-convergence and also mean-square convergence 
due to the growing distance between pulses as n — oo. In fact at n ~ 11°, the probability 
of a one (pulse) is only 1/1. However, we do not have a.s. convergence, since every sample 
sequence has ones appearing arbitrarily far out on the n axis. Thus no sample sequences 
converge to zero. 








532 Chapter 8 Random Sequences 








Figure 8.7-2 A sequence that is converging in probability but not with probability 1. 


S s- surely 
D as- almost surely 
ms- mean square 


p- probability 
p d- distribution 


d 


Figure 8.7-3 Venn diagram illustrating relationships of various possible convergence modes for random 
sequences. 


One final type of convergence that we consider is not a convergence for random variables 
at all! Rather it is a type of convergence for distribution functions. 


Definition 8.7-7 A random sequence X [n] with CDF Fp (x) converges in distribution 
to the random variable X with CDF F(z) if 


Jim Falz) = F(z) 


at all x for which F is continuous. W 


Note that in this definition we are not really saying anything about the random variables 
themselves, just their CDFs. Convergence in distribution just means that as n gets large the 
CDFs are converging or becoming alike. For example, the sequence X[n] and the variable 
X can be jointly independent even though X[n] converges to X in distribution. This is 
radically different from the four earlier types of convergence, where as n gets large the 
random variables X[{n] and X are becoming very dependent because some type of “error” 
between them is going to zero. Convergence in distribution is the type of convergence that 
occurs in the Central Limit Theorem (see Section 4.7). The relationships between these five 
types of convergence are shown diagrammatically in Figure 8.7-3, where we have used the 
fact that p-convergence implies convergence in distribution, which is shown below. Note that 
even sure convergence may not imply mean-square convergence. This because the integral 
of the square of the limiting random variable, with respect to the probability measure, may 
diverge. 
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To see that p-convergence implies convergence in distribution, assume that the limiting 
random variable X is continuous so that it has a pdf. First we consider the conditional 
distribution function 

Fxinx (ylz) = P{X{n] < y|X = z3}. 


From the definition of p-convergence, it should be clear that 


Fxinx (yle) > f y z s , as n — 00, 
so that 
Fx{njjx (ylz) > u(y — zx), except possibly at the one point y = z, 
and hence 


+00 
Fyjaj(v) = P{X{n] < y} = f  Fyjnjx (ule) fx edz 
- [7 uu- a) fx(a)aea 


= L. fx(x)dz 


= Fx(y), 


as was to be shown. In the case where the limiting random variable X is not continuous, 
we must exercise more care but the result is still true at all points x for which F'x(z) is 
continuous. (See Problem 8.54.) 


8.8 LAWS OF LARGE NUMBERS 


The Laws of Large Numbers have to do with the convergence of a sequence of estimates 
of the mean of a random variable. As such they concern the convergence of a random 
sequence to a constant. The Weak Laws obtain convergence in probability, while the Strong 
Laws yield convergence with probability 1. A version of the Weak Law has already been 
demonstrated in Example 4.4-3. We restate it here for convenience. 


Theorem 8.8-1 (Weak Law of Large Numbers) Let X [n] be an independent random 


sequence with mean py and variance o% defined for n > 1. Define another random 
sequence as 


jux[n] Ê (1/n) > Xb] for n > 1. 
k=1 


Then ûx[n] >us (p) asn—oo. W 


534 Chapter 8 Random Sequences 








Remember, an independent random sequence is one whose terms are all jointly inde- 
pendent. Another version of the Weak Law allows the random sequence to be of nonuniform 
variance. 


Theorem 8.8-2 (Weak Law—nonuniform variance) Let X[n] be an independent 
random sequence with constant mean py and variance o$ |n] defined for n > 1. Then if 


o%[n]/n? < 00, 


Ms 


n=l 


Êxln]—> ux (p) ano. W 


Both of these theorems are also true for convergence with probability 1, in which case 
they become Strong Laws. The theorems concerning convergence with probability 1 are 
best derived using the concept of a Martingale sequence. By introducing this concept we 
can also get another useful result called the Martingale convergence theorem, which is 
helpful in estimation and decision/detection theory. 


Definition 8.8-1 ŤA random sequence X |n] defined for n > 0 is called a Martingale 
if the conditional expectation 


E{X[n]|X[n — 1], X[n — 2],..., X[0]} = X[n — 1] foraln>1. E 


Viewing the conditional expectation as an estimate of the present value of the sequence 
based on the past, then for a Martingale this estimate is just the most recent past value. If 
we interpret X |n] as an amount of capital in a betting game, then the Martingale condition 
can be regarded as necessary for fairness of the game, which in fact is how it was first 
introduced [8-1]. 


Example 8.8-1 
(binomial counting sequence) Let W{n] be a Bernoulli random sequence taking values +1 
with equal probability and defined for n > 0. Let X[n] be the corresponding Binomial 
counting sequence 





X{n] ê 5 Wik],  n>0. 


Then X[n] is a Martingale, which can be shown as follows: 


E{X[n]|X[n —1],..., X[0]} = E [5 WI[k]|X [n — mo 


k=0 


= J E{W[k]|X(n — 1], ..., X[O]} 
k=0 


tThe material dealing with Martingale sequences can be omitted on a first reading. 
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n 


= > E{WIk]IW n —3),..., WO)} 


k=0 

= ȘT Wk] + E{Wn]} 
k=0 

= X[n - 1]. 


The first equality follows from the definition of X [n]. The third equality follows from the 
fact that knowledge of the first (n — 1) Xs is equivalent to knowledge of the first (n — 1) 
Ws. The next-to-last equality follows from E[W|W] = W. The last equality follows from 
the fact that E{W[n]} = 0. 


Example 8.8-2 
(independent-increments sequences) Let X{n] be an independent-increments random 
sequence (see Definition 8.1-4) defined for n > 0. Then X,{n] 2x În] — #x[n] is a Martin- 
gale. To show this we write X,[n] = (X,[n] — Xejn — 1]) + X.-[n — 1] and note that by 
independent increments and the fact that the mean of Xe is zero, we have 





E{X,|n]|X-|[n 7 1], see , Xe[0]} = E{X,[n] = X-[n ~ 1]|X-[n 7 1], n.3 Xe[0]} 
+E{X_[n 7 1]|Xefn ~ 1], vee , Xe[0]} 
= E{X,|n] — Xeln — 1]} + X-[n — 1] 
= X,[n — 1]. 
The next theorem shows the connection between the Strong Laws, which have to do 


with the convergence of sample sequences, and Martingales. It provides a kind of Chebyshev 
inequality for the maximum term in an n-point Martingale sequence. 


Theorem 8.8-3 Let X[n] be a Martingale sequence defined on n > 0. Then for every 
€ > 0 and for any positive n, 


>el < 2 2, 
P| max IXI] > e] < ELX? I/e 
Proof For 0 <j < n, define the mutually exclusive events, 


Aj 2 {|X[k] > e for the first time at j}. 


Then the event {maxo<k<n |X[k]| > €} is just a union of these events. Also define the 
random variables, 


i451 if A; occurs, 
7 |0, otherwise, 


called the indicators of the events A;. Then 


536 Chapter 8 Random Sequences 





E{X*[n]} > $ EX? In|} (8.8-1) 


since )0y_9 Ij < 1. Also X?[n] = (X[j] + (X[n] - X [j]))*, so expanding and inserting into 
Equation 8.8-1 we get 


E{X?[n} > $ BLP} +29 EX] (X In] - XU) L} 
j=0 j=0 
+P E{(X (| - XUN? 5} 
> J EPUI} +2) EXU] (X fn] - XU) 1}. (8.8-2) 
j=0 j=0 


Letting Z; ax [j|J;, we can write the second term in Equation 8.8-2 as E{Z; (X[n] — X[j])} 
and noting that Z; depends only on X(0],...,X[j], we then have 
E{Z; (X[n] — X[j])} = B{E[Z; (X[n] — X[j]) |X [0], .-., XL] 
= E{Z;E[X|n] - X[9]| X10), ..., XB] 
= E{Z; (X[j] — X[y])} 
= 0. 


Thus Equation 8.8-2 becomes 


E{X? [n]} > $ E{X? VIG} 
7=0 


>eE I; 
j=0 

=P Aj 
j=0 


_ 22 
=e PÍ mex LX >e}. E 


C: iM» 


Theorem 8.8-4 (Martingale Convergence theorem) Let X|n] be a Martingale 
sequence on n > 0, satisfying 


E{X?*[n]} <C< forall n for some C. 
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Then 
X[n] + X (as.) as n— 00, 


where X is the limiting random variable. 


Proof Let m > 0 and define Y[n] 2 X[n +m] — X[m] for n > 0. Then Yf[n] is a 
Martingale, so by Theorem 8.8-3 


P amex, |X[m + k] — X[m]| >e] < ZE {Y° in}, 


where 
E{Y? [n]} = E{(X[n + m] - X[m))?} 
= E{X?[n + m]} — 2E{X [n+ m]X[m]} + E{X?2|m]}. 
Rewriting the middle term, we have 
E{X[m]X[n + m)} = E{X|[m]E[X[n + m]|X[m],..., X[0]]} 
= E{X|m]X|m]} 
= E{X?[m]} since X is a Martingale, 


5o E{Y?[n]} = E{X?[n + m]} — E{X?[m]} >0 for all m,n >0. (8.8-3) 


Therefore E{X?[n]} must be monotonic nondecreasing. Since it is bounded from above 
by C < oo, it must converge to a limit. Since it has a limit, then by Equation 8.8-3, the 
E{Y2(n]} — 0 as m and n — oo. Thus, 


lim P [pex Ix im + k] — X[m]| >e| =0, 
which implies P[limy;_,.. max,~>9 |X[m + k] — X[m]| > £] = 0 by the continuity of the 


probability measure P (cf. Corollary to Theorem 8.1-1). Finally by the Cauchy convergence 
criteria, there exists a random variable X such that 


X[n] > xX (as.). B 


Theorem 8.8-5 (Strong Law of Large Numbers) Let X[n] be a WSS independent 
random sequence with mean jx and variance o- defined for n > 1. Then as n — oo 


Axin]=— > Xk px (as) 
k=1 


Proof Let Y[n] £ Er-1 4 Xelk]; then Y[n] is a Martingale on n > 1. Since 
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“1 2 1 
E{Y?|n]} =) pox Sok pa =C 
k=1 k=1 


we can apply Theorem 8.8-4 to show that Y [n] — Y (a.s.) for some random variable Y. 
Next noting that X,[k] = k (Y [k] — Y [k — 1]), we can write 


15 xk] = 2 STAY IA STAY [hI] 
k=1 k=1 k=1 


= -+ SY [a] + ny in 
k=1 


—-Y+Y=0 (a.s.) 


so that 
Ax[n]> ux (as). I 


SUMMARY 


In this chapter we introduced the concept of a random sequence and studied its properties 
and ways to characterize it. We defined the random sequence as a family of sample sequences 
each associated with an outcome or point in the sample space. We introduced several impor- 
tant random sequences. Then we reviewed linear discrete-time theory and considered the 
practical problem of finding out how sample sequences are modified as they pass through 
the system. Our emphasis was on how the mean and covariance function are transformed 
by a linear system. We then considered the special but important case of stationary and 
WSS random sequences and introduced the concept of power spectral density for them. We 
looked at convergence of random sequences and learned to appreciate the variety of modes 
of convergence that are possible. We then applied some of these results to the laws of large 
numbers and used Martingale properties to prove the important strong law of large numbers. 

Some additional sources for the material in this chapters are [8-9], [8-10], and [8-11]. 

In the next chapter we will discover that many of these results extend to the case of 
continuous time as we continue our study with random processes. 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
8.1 Consider the following autoregressive process. 
Wr = 2Wa_-1 + Xn, Wo = 0 


1 
n= 5 n-1+ Xn, Zo = 0 


Express W,, and Zn in terms of Xn, Xn-1,.- -, Xı and then find E[W,] and E[Zp]. 
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*8.2 


8.3 


8.4 


*8.5 


8.6 


Consider an N-dimensional random vector X. Show that pairwise independence of 
its random variable components does not imply that the components are jointly 
independent. 
Let X = (X1,Xo,...,X5)" be a random vector whose components satisfy the 
equations 

Xi = XM1+B, 157155, 


where the B;, are jointly independent and Bernoulli distributed, taking on values 0 
and 1, with mean value 1/2. The first value is X, = Bı. Put the B; together to 
make a random vector B. 


(a) Write X = AB for some constant matrix A and determine A. 
(b) Find the mean vector py. 

(c) Find the covariance matrix Kgs. 

(d) Find the covariance matrix Kxx. 


[For parts (b) through (d), express your answers in terms of the matrix A]. 


Let a collection of sequences z(n, 6;) be given in terms of a deterministic parameter 


Ôk as 
2rn 
f cos( 75" + m) 


Now define a random variable O taking on values from the same parameter set {0x}. 
Let the PMF of O be given as 


N-1 


k=0 


1 


Po(Or) = 57 for k=0,....N—1. 


Now set X [n] 2 cos(22" + 0). 


(a) Is X[n] a random sequence? If so, describe both the mapping X(n,¢) and 
its probability space (Q, F, P). If not, explain fully. 

(b) Let 0, = 2% for k =0,...,N — 1, and find E{X[n]}.t 

(c) For the same 6, as in part (b), find E{X[n]X|m]}. Take N > 2 here. 


Often one is given a problem statement starting as follows: “Let X be a real-valued 
random variable with pdf fx (x) ....” Since an RV is a mapping from a sample space 
Q with field of events Z and a probability measure P, evidently the existence of 
an underlying probability space (Q, % P) is assumed by such a problem statement. 
Show that a suitable underlying probability space (0,.% P) can always be created, 
thus legitimizing problem statements such as the one above. 

Let T be a continuous random variable denoting the time at which the first photon 
is emitted from a light source; T is measured from the instant the source is ener- 
gized. Assume that the probability density function for T is fr(t) = Ae~**u(t) with 
A> 0. 


tNote: cos(A + B) = cos A cos B — sin Asin B and cos Acos B = 3 {cos(A + B) + cos(A — B)}. 
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8.8 


8.9 


8.10 
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(a) What is the probability that at least one photon is emitted prior to time t2 
if it is known that none was emitted prior to time tı, where tı < t2? 
(b) What is the probability that at least one photon is emitted prior to time tz 
if three independent sources of this type are energized simultaneously? 
Let Zı, Z2,... be independent and identically distributed random variables with 
P(Z, = 1) =p and P(Z, = -1) = 1 — p, Yn 
n 
Let Xn = J Z; n = 1,2,... and Xo = 0. {Xn} is called a simple random walk in 
i=1 
one dimension 
(a) Compute the first order pmf of Xn. 
(b) Find P(X, = —2) after 4 steps. 


Let X and Y be iid. random variables with the exponential probability density 
functions 


fx(w) = fr(w) = re” u(w). 


(a) Determine the probability density function for the ratio 


A 


O<R 1, that is, fr(r), O<r<l. 


—"_< 
X+Y 7 


(b) Let A be the event X < 1/Y. Determine the conditional pdf of X given that 
A occurs and that Y = y; that is, determine 


fx(2|A,Y =y). 


(c) Using the definitions of (b), what is the minimum mean-square error estimate 
of X given that the event A occurs and that Y = y? 


Use the Schwarz inequality for complex random variables to prove that 
|Rx[m]| < Rx[0], for all integers m 


for any complex-valued WSS random sequence X [rm]. 
Let X = (X1, Xo,...,Xio)F be a random vector whose components satisfy the 
equations, 


2 
Xi = zi + Xigi)+W, for2<i<9, 


where the W; are independent and Laplacian distributed with mean zero and vari- 
ance a? for i = 1 to 10, and Xy = 5X2 + SW, and Xio = 1 Xo + 5Wio. 
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(a) Find the mean vector px. 
(b) Find the covariance matrix Kxx. 
(c) Write an expression for the multidimensional pdf of the random vector X. 


[Hint: 
1 pp... Ø 
ale 1 P pr... 
Matrix identity: ifA=|]p? p 1 weed, 
n pt P 
p” p 1 
then AT! is given as 
l—pa -a 0 0 
—a 1 -a 0 wee 
B@A*=| 0 -a 1 0 
cee 0 cae eee —a 
0 ee 0 -a 1-pa 


with a Ê ifs and 8? £ ie. The Laplacian pdf is given as 


fw(w) = —— exp (-vaiet) , —oo < w < +00. 


8.11 Prove Corollary 8.1-1. 
8.12 Let {X;} be a sequence of iid. Normal random variables with zero-mean and unit 
variance. Let 


Sk Xi +X2+...+Xp fork >1. 


Determine the joint probability density function for S,, and Sm, where 1< m< n. 
8.13 In Example 8.1-8 we saw that CDFs are continuous from the right. Are they contin- 
uous from the left also? Either prove or give a counterexample. 
8.14 Let the probability space (N), Z P) be given as follows: 


Q = {a, b,c}, that is, the outcome ¢ = a or b or c, 
¥ = all subsets of Q, 
P[{¢}] = 1/3 for each outcome ¢. 
Let the random sequence X [n] be defined as follows: 
X(n, a] = 36[n] 
X[n, b] = u[n — 1] 


X[n, c] = cosan/2. 
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8.16 


8.17 
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(a) Find the mean function py [n]. 
(b) Find the correlation function Rx x[m, n]. 
(c) Are X[1] and X [0] independent? Why? 
Let Xn be an independently and identically distributed sequence of Gaussian random 
variables with zero mean and variance o”. Let Y,, be the average of the two consec- 
utive values of Xn: 
_ Xn + Xn-1 
~ 2 
Determine whether {Yn} is a wide-sense stationary process. 
Consider a random sequence X |n] as the input to a linear filter with impulse response 
1/2, n=0 
hin] = 4 1/2, n=1 
0, else. 


Yn 


We denote the output random sequence Y[n], that is, for each outcome Ç, 
k=+00 
Yin] = $ Alk]X[n —k,¢]. 
k=—00 
Assume the filter runs for all time, —oo < n < +00. We are given the mean function 


_of the input x [n] and correlation function of the input Rx x[n1, 2]. Express your 


answers in terms of these assumed known functions. 


(a) Find the mean function of the output py [n]. 

(b) Find the autocorrelation function of the output Ryy([ni, n2]. 

(c) Write the autocovariance function of the output Kyy[ni, n2] in terms of your 
answers to parts (a) and (b). 

(d) Now assume that the input X[n] is a Gaussian random sequence, and write 
the corresponding joint pdf of the output fy (yi, Y2; n1, 2) at two arbitrary 
times nı Æ nz in terms of py |n] and Kyy [n1, nə]. 

Let I, be a sequence of independent Bernoulli random variables. J, is then an 
independent and identically distributed random sequence taking on values from the 
set {0, 1}. 

(a) Let Dn = 27, — 1. Then 

O fL n=l 
Da = f In =0 
Determine the mean and variance of Dy. 


(b) Let Sn = $5 Ix. Obtain the first order pmf of Sn and its mean and variance. 
k=1 


Let T[n] denote the random arrival sequence studied in class, 


n 


Tin] = $ rik), 


k=1 


where the 7[k] are an independent random sequence of interarrival times, distributed 
as exponential with parameter À > 0. 
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8.19 


8.20 





(a) Find the CF of this random sequence, that is, 
rlw; n) = Elet#T I). 
(b) Use this CF to find the mean function y(n]. 


Let the random sequence T[n] be defined on n > 1 and for each n, have an Erlang 
pdf: 

(At)"-1 
(n—1)! 


Define the new random sequence 7[n] £ T[n]—T|[n—1] for n > 2, and set 7[1] £ T[1]. 
Can we conclude that 7[n] is exponential with the same parameter A? If not, what 
additional information on the random sequence T{n] is needed? Justify your answer. 
This problem considers a random sequence model for a charge coupled device (CCD) 
array with very “leaky” cells. We start by defining the width-3 pulse function: 





fr(t;n) = Ae u(t), A> 0. 


1/4 n=-1 

_ jJ1/2 n=0 

hl = 4 14 n= 
0 else, 


and as illustrated in Figure P8.20, which we will use to account for 25 percent of 
the charge in a cell that leaks out to its right neighbor and 25 percent that leaks to 
its left neighbor. We assume that the one-dimensional CCD array is infinitely long 
and represents the array contents by the random sequence X: 


i=+00 


Xin = $ AG alr — i), 


i=—00 


hin) 





Figure P8.20 Pulse function of leaky cell. 
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where ¢, is the ith component of Ç, the infinite dimensional outcome of the experi- 
ment. The random variables A(¢;) are jointly independent and Gaussian distributed 
with mean À and variance À. 


(a) Find the mean function px [n]. 
(b) Find the first-order pdf fx (x; 7). 
(c) Find the joint pdf fx (xi, 22;n,n +1). 


8.21 We are given a random sequence X[n] for n > 0 with conditional pdf's 
fx (2n|2n-1) = aexp[—a(tn — 2n-1)| U(Tn —2n-1) forn>1, 


with u(x) the unit-step function and initial pdf fx(xo) = 6(z0). Take a > 0. 


(a) Find the first-order pdf fx (zn) for n = 2. 
(b) Find the first-order pdf fx (£n) for arbitrary n > 1 using mathematical induc- 
tion. 


8.22 Let z[n] be a deterministic input to the LSI discrete-time system H shown in 
Figure P8.23. 


(a) Use linearity and shift-invariance properties to show that 


+00 
yin) = zin] +h] Y afklhin — k) = hin] * zin]. 


k=—00 


(b) Define the Fourier transform of a sequence a[n] as 


oo 
A(w) £ > alnje*", -r Lw < +r, 


n=—co 


and show that the inverse Fourier transform is 
1 st , 
ajn) = x | Alwe” dw, —oo < n < +00. 
2T Jor 


(c) Using the results in (a) and (b), show that 
Y (w) = H(w)X (w), -Tr Sw < ++T, 
for an LSI discrete-time system. 
8.23 Consider the difference equation 
yin] + ay[n — 1] = zin], —o0 < n < +00, 


where —1 <Q < +1. 


(a) Let the input be z[n] = 6" u[n] for —1 < 6 < +1. Find the solution for y(n] 
assuming causality applies, that is, y [n] = 0 for n < 0. 
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[n] 
xin] hin] yin. 


Figure P8.23 LSI system with impulse response A[n]. 


(b) Let the input be z[n] = G-"u[—n] for —1 < 6 < +1. Find the solution for 
y|[n] assuming anticausality applies, that is, y [n] = 0 for n > 0. 


8.24 Let X[n] be a WSS random sequence with mean zero and covariance function 
Kxx[m] = 07!" for all — oo < m < +00, 
where p is a real constant. Consider difference equations of the form 
Y[n] = X[n] —aX[n-1] with — co <n < +00. 


(a) Write the covariance function of Y |n] in terms of the parameters o°, p, and a. 
(b) Find a value of a such that Y [n] is a WSS white noise sequence. 
(c) What is the average power of this white noise? 


8.25 Let W[n] be an independent random sequence with mean 0 and variance o%, defined 
for —co < n < +00. For appropriately chosen p, let the stationary random sequence 
X [n] satisfy the causal LCCDE 


X[n] = pX[n — 1] + W[n], —0o < n < +00. 


(a) Show that X[n — 1] and W [n] are independent at time n. 
(b) Derive the characteristic function equation 


Px (w) = Ox (pw) By (w). 


(c) Find the continuous solution to this functional equation for the unknown 
function ®x when W[n] is assumed to be Gaussian. [Note: ®x (0) = 1.] 
(d) What is 02? 


tThis part requires more detailed knowledge of the z-transform. (cf. Appendix A.) 
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8.26 


8.27 


8.28 


8.29 


8.30 


Consider the LSI system shown in Figure P8.26, whose deterministic input z[n] 
is contaminated by noise (a random sequence) W[n]. We wish to determine the 
properties of the output random sequence Y |n]. The noise W [n] has mean py,(n] = 2 
and autocorrelation E{W[m|W[n]} = 0%,6[m — n] + 4. The impulse response is 
hin] = pun] with |p| < 1. The deterministic input z[n] is given as xz[n] = 3 for 
all n. 


(a) Find the output mean py [n]. 
(b) Find the output power E{Y?[n]}. 
(c) Find the output covariance Kyy [m, n]. 


Win] 


xin] Yin] 
(+) hin] 


Figure P8.26 LSI system with deterministic-plus-noise input. 


Show that the random sequence X[n] generated in Example 8.1-15 is not an inde- 
pendent random sequence. 

The impulse response of a discrete linear time-invariant system is given by h[n] = 
a”u[n] where |a| < 1, and u[n] is the unit step sequence defined by 


1 n>0O 
unl= 19 n<o 


If the input sequence X [n] is a discrete-time white noise with power spectral density 
No 
2 . 
Let Xn consist or two interleaved sequences of independent random variables. For 


, find the power spectral density of the output Y [n]. 


n even, Xn assumes the values +1 with probability z for n odd, X,, assumes the 


1 1 
values 3 and —3 with probabilities = and io respectively. verify whether 
(a) {Xn} is WSS. 
(b) {Xn} is stationary. 


Consider a WSS random sequence X[n] with mean py[n] = u, a constant, and 
correlation function Rxx[m] = p*6[m] with p? > 0. In such a case u must be 
zero, as you will show in this problem. Note that the covariance function here is 
Kx x |m] = p?6[m] - p?. 
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(a) Take m = 0 and conclude that p? > p?. 

(b) Take a vector X of length N out of the random sequence X[n]. Show that 
the corresponding covariance matrix Kxx will be positive semidefinite only 
if u? < 0?/(N — 1), where o? 2 p-p. (Hint: Take coefficient vector a = 1, 
i.e., all 1’s.) 

(c) Let N — oo and conclude that u must be zero for the stationary white noise 
sequence X fn]. 


8.31 A discrete-time system is given by 
Y [n] = aY [n — 1] + X [n] where |a| < 1. 


The input X [n] is discrete-time white noise with average o?. The impulse response 
h[n] of the system defined by A[n] = ah|n — 1] + 6[n]. Find the spectral density and 
average power of the output Y [n]. 

8.32 Let the WSS random sequence X have correlation function 


Rxx|m] = 10e772 rl 4. Be 2! 


with A, > 0 and A, > 0. Find the corresponding psd Sx x(w) for |w| < ~r. 

8.33 The psd of a certain random sequence is given as Sx x(w) = 1/[(1 +a?) — 2a cos w]? 
for -r < w < +a, where |a| < 1. Find the random sequence’s correlation func- 
tion Rx [ml]. 

8.34 Let the input to system H (w) be W [n], a white noise random sequence with py [n] = 
0 and Kww [m] = 6[m]. Let X [n] denote the corresponding output random sequence. 
Find Kxw|m] and Sxw(w). 

8.35 Consider the system shown in Figure P8.35. Let X[n] and V [n] be WSS and mutually 
uncorrelated with zero mean and psd’s Sx x (w) and Syy(w), respectively. 


VIn] 





Figure P8.35 LSI system with random signal-plus-noise input. 


(a) Find the psd of the output Syy (w). 
(b) Find the cross-power spectral density between the input X and the output Y, 
that is, find Sxy (w). 


8.36 Consider the discrete-time system with input random sequence X[n] and output 
Y [n] given as 


Assume that the input sequence X [n] is WSS with psd Sxx(w) = 2. 
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(a) Find the psd of the output random sequence Syy (w). 
(b) Find the output correlation function Ryy [m]. 


8.37 Let the stationary random sequence Y [n] = X[n]+U|n] with power spectral density 
(psd) Sy(w) be our model of signal X plus noise U for a certain discrete-time 
channel. Assume that X and U are orthogonal and also assume that we have 
Sy(w) > 0 for all |w| < 7. As a first step in processing Y to find an estimate for 
X, let Y be input to a discrete-time filter G(w) defined as G(w) = 1/./Sy(w) to 
produce the stationary output sequence W [n] as shown in Figure P8.37a. 


(a) 
(b) 


(c) 


U[n] 


(+) G@) 


Figure P8.37a 


Find the psd of W [n], that is, Sy (w), and also the cross-power spectral density 
between original input and output Sxw/(w), in terms of Sx, Sy, and Sy. 

Next filter W |n] with an FIR impulse response h[n], n = 0,..., N — 1, to give 
output X[n], an estimate of the original noise-free signal X [n] as shown in 
Figure P8.37b. In line with the Hilbert space theory of random variables, we 


AN 
Win] X{[n] 
h[n] 


Figure P8.37b 


decide to choose the filter coefficients A[n] so that the estimate error X [n] — 
X [n] will be orthogonal to all those W [n] actually used in making the estimate 
at time n. Write down the resulting equations for the N filter coefficients 
h[0], h{1], ..., ALN — 1]. Your answer should be in terms of the cross-correlation 
function Rxw [mn]. 

Let N go to infinity, and write the frequency response of h[n], that is, H(w), 
in terms of the discrete-time power spectral densities Sx x(w) and Syy(w). 


8.38 Higher than second-order moments have proved useful in certain advanced applica- 
tions. Here we consider a third-order correlation function of a stationary random 
sequence 


Rx[mi, m2] 4 E{X|[n + mi|X[n + m2] X*[n]} 


defined for the random sequence X[n], -20 < n < +00. 
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(a) 


(b) 


8.39 


8.40 


8.41 








Let Y [n] be the output from an LSI system with impulse response h[n], due to the 
input random sequence X|n]. Determine a convolution-like equation expressing 
the third-order correlation function of the output Ry[m;, m2] in terms of the 
third-order correlation function of the input Rx [7™m1, mz] and the system impulse 
response /[n]. 


Define the bi-spectral density of X as the two-dimensional Fourier transform 


Sx (wi, we) = XO) Rxlmi, m] exp —j (wim + wmz). 


mı M2 


For the system of part (a), find an expression for the bi-spectral density of the 
output Sy (w1,w2) in terms of the system frequency response H(-) and the bi- 
spectral density of the input Sx (w1, w2). 


Let X [|n] be a Markov chain on n > 0 taking values 1 and 2 with one-step transition 
probabilities, 


Py Ê P{X[n]=s1XIn- =H, 1 S45 <2, 
given in matrix form as 
0.9 0.1 
P= [oiz a8 = (Piy). 
We describe the state probabilities at time n by the vector 
A 
pin] = [P{X[n] = 1}, P{X[n] = 2}. 


(a) Show that p[n] = p[0]P”. 

(b) Draw a two-state transition diagram and label the branches with the one- 
step transition probabilities p;;. Don’t forget the pi; or self-transitions. (See 
Figure 8.5-1 for state-transition diagram of a Markov chain.) 

(c) Given that X[0] = 1, find the probability that the first transition to state 2 
occurs at time n. 


Consider using a first-order Markov sequence to model a random sequence X [n] as 
X[n] = rX[n — 1] + Zin], 


where Z[n] is white noise of variance oĉ. Thus, we can look at X[n] as the output . 
of passing Z[n] through a linear system. Take |r| < 1 and assume the system has 
been running for a long time, that is, ~oo < n < +00. 

(a) Find the psd of X[n], that is, Sxx(w). 

(b) Find the correlation function Rx x [m]. 
We defined a Markov random sequence X [n] in this chapter as being specified by its 
first-order pdf fx (z;n) and its one-step conditional pdf 


fx (£n|£n-1; n,n — 1) = fx(tnlen_1) for short. 
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8.42 


8.43 


8.44 


8.45 


(a) Find the two-step pdf for a Markov random sequence fx (£n|£n-2) in terms 
of the above functions. Here, take n > 2 for a random sequence starting 
at n= 0. 
(b) Find the N-step pdf fx (tn|%n—n) for arbitrary positive integer N, where we 
only need consider n > N. 
Consider a generalized random walk sequence X [n] running on {n > 0} and defined 
as follows: 


X(0] êo, 
XinJ 232 WI, n>0, 
k=1 


where W[n] is an independent random sequence, stationary, and taking values below 
with the indicated probabilities, 


A +s 1,p= 1/ 2, 
Wn] 7 { —s9,p = 1/2. 
We see the difference is that the positive and negative step sizes are not the same 
sı É $2, S1 > 0 and s2 > 0. 
(a) Find the mean function p(n] 2 E{X[n]}. 
(b) Find the autocorrelation function Rx [ni, n2] 4 E{X [ni] X [ng]}. 
Consider a Markov random sequence X [|n] running on 1 < n < 100. It is statistically 
described by its first-order pdf fx (z; 1) and its one-step transition (conditional) pdf 


fx(n|¢n_-1;7n,n — 1). By the Markov definition, we have (suppressing the time 
variables) that 


fx (En|En-1) = fx (Zn|en-1,En—2,---,21) for 2 <n < 100. 
Show that a Markov random sequence is also Markov in the reverse order, that is, 
Fx (tnlen41) = fx(En|En+1, En+2:---, £100) for 1 <n < 99, 
and so one can alternatively statistically describe a Markov random sequence by the 
one-step backward pdf fx (£n—1|£n;n — 1,7) and first-order pdf fx (x; 100). 
Suppose that the probability of a sunny day (state 0) following a rainy day (state 1) 
1 
is 3° and that the probability of a rainy day following a sunny day is 2 Write the 
2-state transition probability matrix. Given that May 1 is a suuny day, determine 
the probability that May 3 is a sunny day and May 5 is a suuny day. 
Consider the Markov random sequence X [n] generated by the difference equation, 


for n > 1, 
X[n] = aX[n — 1) + GW[n], 


where the input W [n] is an independent random sequence with zero mean and vari- 
ance o%,, the inital value X (0] = 0, and the parameters a and ( are known constants. 
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*8.46 


8.47 


8.48 


8.49 


8.50 


8.51 


(a) Show that the subsequence Y |n] = 4 X [2n] is Markov also. 
(b) Find the variance function o2 [n] Ê E[|Y |n] — py [n]|?] for n > 0. 


Write a MATLAB function called triplemarkov that will compute and plot the auto- 
correlation functions for the asymmetric, two-state Markov model in Example 8.1-16 
for any three sets of parameters {poo, p11}. Denote the maximum lag interval as N. 
Run your routine for {0.2,0.8}, {0.2,0.5}, and {0.2,0.2}. Repeat for {0.8,0.2}, 
{0.8, 0.5}, and {0.8, 0.8}. Describe what you observe. 

Consider the probability space (Q, Z P) with Q = [0,1], ¥ defined to be the Borel 
sets of Q, and P[(0,¢] =¢ forO<¢ <1. 


(a) Show that P[{0}] = 0 by using the axioms of probability. 
(b) Determine in what senses the following random sequences converge: 
(i) X[n,¢] =e", n >0 
(ii) X[n,¢] = sin (¢+ 3) n>l 
(iii) X[n, ¢] = cos"(¢),n > 0. 
(c) If the preceding sequences converge, what are the limits? 
The members of the sequence of jointly independent random variables X|n] have 
pdf’s of the form 


fx(zjn) = (: - *) sey OP |- («- ne) | 


+20 exp(—oz)u(z). 





Determine whether or not the random sequence X [n] converges in 
(i) the mean-square sense, 


(ii) probability, 
(iii) distribution. 


The members of the random sequence X [n] have joint pdf’s of the form 


fx(a,8;m,n) = [m?a? — 2pmnap + n?e’) 


eee (- =F 2) 


for m > 1 and n > 1 where —1 < p < +1. 


(a) Show that X[n] converges in the mean-square sense as n — œ for all —1 < 
p< +l. 

(b) Specify the CDF of the mean-square limit X Ê limn—oo X [n]. 
State conditions under which the mean-square limit of a sequence of Gaussian 
random variables is also Gaussian. 
Let X [n] be a real-valued random sequence on n > 0, made up from stationary and 
independent increments, that is, X[n] — X|[n — 1] = W[n], “the increment” with 
W(n] being a stationary and independent random sequence. The random sequence 
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always starts with X[0] = 0. We also know that at time n = 1, E{X[1]} = 7 and 
Var{X [1]} = o?. 
(a) Find uyn] and o%[n], the mean and variance functions of the random 
sequence X at time n for any time n > 1. 
(b) Prove that X[n]/n converges in probability to 7 as the time n approaches 
infinity. 
8.52 This problem demonstrates that p-convergence implies convergence in distribution 
even when the limiting pdf does not exist. 


(a) For any real number z and any positive £, show that 

P[X < z — e€] < P[X[n] < z] + PIX [n] — X| > el]. 
(b) Similarly show that 

PIX >z +e] < P[X[n] > z] + P |[|X[n] — X| > el. 


For part (c), assume the random sequence X [n] converges to the random 
variable X in probability. 
(c) Let n — oo and conclude that 


Jim Fx (zn) = Fx(2) 


at points of continuity of Fy. 


8.53 Let X[n] be a second-order random sequence. Let h[n] be the impulse response of 
an LSI system. We wish to define the output of the system Y |n] as a mean-square 
limit. 


(a) Show that we can define the mean-square limit 


Y [n] 2 5“ h[k] X[n — k], —oo <n < +00, (m.s.) 


k=- 


XO YO hika []Rxx]n- k,n- l] < 00 for all n. 

k ol 
(Hint: Set Yy [n] £ TA h[k]X[n — k] and show that m.s. limit of Yy [n] 
exists by using the Cauchy convergence criteria.) 

(b) Find a simpler condition for the case when X[n] is a wide-sense stationary 
random sequence. 

(c) Find the necessary condition on h[n] when X[n] is (stationary) white noise. 


8.54 If X[n] is a Martingale sequence on n > 0, show that 


E{X[n+m]|X[m],...,X[0]} = X[m] for all n > 0. 
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8.59 


Let Y [n] be a random sequence and X a random variable and consider the conditional 
expectation 
A 
E{X|Y [0], ..., Y [n]} = Gln]. 
Show that the random sequence G[n] is a Martingale. 


We can enlarge the concept of Martingale sequence somewhat as follows. Let G[n] £ 
g(X [0], .. -, X [n]) for each n > 0 for measurable functions g. We say G is a Martingale 
with respect to X if E{G[n]|X[0],...,X[n — 1]} = G[n — 1]. 

(a) Show that Theorem 8.8-3 holds for G a Martingale with respect to X. Specif- 
ically, substitute G for X in the statement of the theorem. Then make neces- 
sary changes to the proof. 

(b) Show that the Martingale convergence Theorem 8.8-4 holds for G a Martin- 
gale with respect to X. 


Consider the hypothesis-testing problem involving (n+1) observations X[0],...,X[n] 
of the random sequence X. Define the likelihood ratio 


a fx(X[0],...,XIn]|zi) 


fx(X(0],...,X[n]|Ho)’ 2 0, 


corresponding to two hypotheses H, and Ho. Show that Lx([n] is a Martingale with 
respect to X under hypothesis Ho. 

In the discussion of interpolation in Example 8.4-7, work out the algebra needed to 
arrive at the psd of the up-sampled random sequence X,[n]. 

The up-sampled sequence X,[n] in the interpolation process is clearly not WSS, even 
if X[n] is WSS. Create an up-sampled random sequence that is WSS by randomizing 
the start-time of the sequence X[n]. That is, define a binary random variable © 
with Pie = = 0] = P[O = 1] = 0.5. Define the start-time randomized sequence by 


X,{n] 2 X[n + 9}. Then the resulting up-sampled sequence is Xer[n] = X [248]. 
Show that Rx, x,[k] = Rxx[k] and Rx.,.x,,[m,m+k] = Rx.. x. |k] = 0.5Rx x [k/2] 
for k even, and zero for k odd. 
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i Random Processes 


In the last chapter, we learned how to generalize the concept of random variable to that 
of random sequence. We did this by associating a sample sequence with each outcome 
C € Q, thereby generating a family of sequences collectively called a random sequence. 
These sequences were indexed by a discrete (integer) parameter n is some index set Z. In this 
chapter we generalize further by considering random functions of a continuous parameter. 
We consider this continuous parameter time, but it could equally well be position, or angle, 
or some other continuous parameter. The collection of all these continuous time functions is 
called a random process. Random processes will be perhaps the most useful objects we study 
because they can be used to model physical processes directly without any intervening need 
to sample the data. Even when of necessity one is dealing with sampled data, the concept 
of random process will give us the ability to reference the properties of the sample sequence 
to those of the limiting continuous process so as to be able to judge the adequacy of the 
sampling rate. 

Random processes find a wide variety of applications. Perhaps the most common use 
is as a model for noise in physical systems, modeling of the noise being the necessary first 
step in deciding on the best way to mitigate its negative effects. A second class of applica- 
tions concerns the modeling of random phenomena that are not noise but are nevertheless 
unknown to the system designer. An example would be a multimedia signal (audio, image, 
or video) on a communications link. The signal is not noise, but it is unknown from the 
viewpoint of a distant receiver and can take on many (an enormous number of) values. Thus, 
we model such signals as random processes, when some statistical description of the source 
is available. Situations such as this arise in other contexts also, such as control systems, 
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pattern recognition, etc. Indeed from an information theory viewpoint, any waveform that 
communicates information must have at least some degree of randomness in it. 

We start with a definition of random process and study some of the new difficulties to 
be encountered with continuous time. Then we look at the moment functions for random 
processes and generalize the correlation and covariance functions from Chapter 8 to this 
continuous parameter case. We also look at some basic random processes of practical impor- 
tance. We then begin a study of linear systems and random processes. Indeed, this topic is 
central to our study of random processes and is widely used in applications. Then we present 
some classifications of random processes based on general statistical properties. Finally, we 
introduce stationary and wide-sense stationary random processes and their analysis for 
linear systems. 


9.1 BASIC DEFINITIONS 


It is most important to fully understand the basic concept of the random process and its 
associated moment functions. The situation is analogous to the discrete-time case treated 
in Chapter 8. The main new difficulty is that the time axis has now become uncountable. 
We start with the basic definition. 


Definition 9.1-1 Let (2,¥% P) be a probability space. Then define a mapping X 
from the sample space Q to a space of continuous time functions. The elements in this 
space will be called sample functions. This mapping is called a random process if at each 
fixed time the mapping is a random variable, that is, X(t,¢) € ¥ t for each fixed t on the 
real line —o0 < t < +o. E 


Thus we have a multidimensional function X(t, ¢), which for each fixed outcome ¢ 
is an ordinary time function and for each fixed t is a random variable. This is shown 
diagrammatically in Figure 9.1-1 for the special case where the sample space Q is the 
continuous interval [0,10]. We see a family of random variables indexed by £ when we look 
along the time axis, and we see a family of time functions indexed by ¢ when we look along 
the outcome “axis.” 

We have the following elementary examples of random processes: 





Example 9.1-1 
(simple process) X(t,C) = X(¢) f(t), where X is a random variable and f is a deterministic 
function of the parameter t. We also write X(t) = X f(t). 











Example 9.1-2 
(random sinewave) X(t,¢) = A(¢)sin(wot + O(¢)), where A and © are random variables. 
We also write X(t) = Asin(wot + O), suppressing the outcome ¢. 


More typical examples of random processes can be constructed from random sequences. 


tX € F is shorthand for {¢: X(¢) < z} C ¥ for all z. This condition permits us to measure the 
probability of events of this kind and hence define CDFs. 
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X(t, f) 





Figure 9.1-1 A random process for a continuous sample space Q = [0,10]. 


Example 9.1-3 —-— ~ S 
X(t) = „n X[n]pn(t — T[n]), where X[n] and T[n] are random sequences and the functions 
Pn(t) are deterministic waveforms that can take on various shapes. For example, the pn(t) 
might be ideal unit-step functions that could provide a model for a so-called jump process. 
In this interpretation the T[n] would be the times of the arrivals and the X |n] would be the 
amplitudes of the jumps. Then X(t) would indicate the total amplitude up to time t. If all 
the X[n]’s were 1, we would have a counting process in that X(t) would count the arrivals 
prior to time t. 








If we sample the random process at n times tı through tn, we get an n-dimensional 
random vector. If we know the probability distribution of this vector for all times tı through 
t, and for all positive n, then clearly we know a lot about the random process. If we know 
all this information, we say that we have statistically specified (statistically determined) 
the random process in a fashion that is analogous to the corresponding case for random 
sequences. 


Definition 9.1-2 A random process X (t) is statistically specified by its complete set 
of nth-order CDFs (pdf’s or PMFs) for all positive integers n, that is, Fx(x1,22,..-,;2n} 
ti, ta, ...,tn) for all 21, 22,...,2, and for all -co < tı < t2 <... < tn < %0. E 


The term statistical comes from the fact that this is the limit of the information 
that could be obtained from accumulating relative frequencies of events determined by 
the random process X(t) at all finite collections of time instants. Clearly, this is all we 
could hope to determine by measurements on a process that we wish to model. However, 
the question arises: Is this enough information to completely determine the random process? 
Unfortunately the general answer is no. We need to impose a continuity requirement on the 
sample functions z(t). To see this the following simple example suffices. 
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Example 9.1-4 SSS 
(from Karlin [9-1]) Let U be a uniform random variable on [0,1] and define the random 
processes X(t) and Y (t) as follows: 


A fl fort=U 
X(t) = f, else, 


and A 
Y(t) =0 for all t. 


Then Y(t) and X(t) will have the same finite-order distributions, yet obviously the proba- 
bility of the following two events is not the same: 


{X(t) < 0.5 for all t} 


and 
{Y (t) < 0.5 for all t}. 


To show that Y(t) and X(t) have the same nth-order pdf’s, find the conditional nth-order 
pdf of X given U = u, then integrate out the conditioning on U. We leave this as an exercise 
to the reader. 


The problem in Example 9.1-4 is that the complementary event {X(t) > 0.5} for 
some t € [0,1]} involves an uncountable number of random variables. Yet the statistical 
determination and the extended additivity Axiom 4 (see Section 8.1) only allow us to 
evaluate probabilities corresponding to countable numbers of random variables. In what 
follows, we will generally assume that we always have a process “continuous enough” that 
the family of finite-order distribution functions suffices to determine the process for all 
time.t Such processes are called separable. The random process X(t) of the above example 
is obviously not separable. 

As in the case of random sequences, the moment functions play an important role in 
practical applications. The mean function, denoted by x(t), is given as 


x(t) E[X(t)],  -co < t < too. (9.1-1) 
Similarly the correlation function is defined as the expected value of the conjugate product, 
Rx x (ti, te) 4 E[X (ti) X* (te)], —0o < t1,t2 < +00. (9.1-2) 


The covariance function is defined as the expected value of the conjugate product of the 


centered process X(t) £ X(t) — ux (t) at times tı and te: 
Kxx(tı,t2) 2 E[Xe(t1) X? (t2)] 
A (9.1-3) 
= El(X(t1) — wx (t1))(X (t2) — ux (t2))*]. 


tAn exception is white noise to be introduced in Section 9.3. 
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Clearly these three functions are not unrelated and in fact we have, 

Kx x(t, t2) = Rxx(ti, t2) — Ux (tr) wi (t2). (9.1-4) 
We also define the variance function as of (t) £ Kxx(t,t) = E [|X.(t)|?], and the power 
function Rxx(t,t) = E [|X(t)|?}. 


Example 9.1-5 
(more on random sinewave) Consider the random process 


X(t) = Asin(wot + 9), 


where A and O are independent, real-valued random variables and O is uniformly distributed 
over [—7, +r]. For this sinusoidal random process, we will find the mean function py (t) and 
correlation function Rx x (t1, t2). First 


x(t) = BlAsin(wot + ©)] 
= E[A]E[sin(wot + 8)] 


1 f" 
= HA 5 f sin(wot + 0)d0 


=pa:0=0. 
Then for the correlation, 
Rxx(tı, t2) = E[X (t1)X* (t2)] 

= E[A?’ sin(wot + O) sin(wot2 + 8)] 

= E[A?]E[sin(wotı + O) sin(wot2 + Ə)]. 
Now, the second factor can be rewritten as 

3{E[cos(wo(ti — t2))] — E[cos(wo(ti + t2) + 20)]} (9.1-5) 
by applying the trigonometric identity 
sin(B) sin(C) = 3} {cos(B — C) — cos(B + C)}, 


and bringing the expectation operator inside. Then, since O is uniformly distributed over 
[—x, +7], the integral arising from the second expectation in Equation 9.1-5 is zero, and we 
finally obtain 

Rxx (ti, te) = 5 E [A cos wo(t1 — t2). 


We note that u x(t) = 0 (a constant) and Rx x (t1, t2) depends only on tı —tz. Such processes 
will be classified as wide-sense stationary in Section 9.4. 
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As in the discrete-time case, the correlation and covariance functions are Hermitian 
symmetric, that is, 


Rxx (ti, te) = Ry x (ta, t1), 
Kxx (t1, tz) = Kx x (ta, 1), 


which directly follow from the linearity of the expectation operator E. 

If we sample the random process at N times f;,t2,...,t~, we form a random vector. 
We have already seen that the correlation or covariance matrix of a random vector must 
be positive semidefinite (cf. Chapter 5). This, then, imposes certain requirements on the 
respective correlation and covariance function of the random process. Specifically, every 
correlation (covariance) matrix that can be formed from a correlation (covariance) function 
must be positive semidefinite. We next define positive semidefinite functions. 


Definition 9.1-3 The two-dimensional function g(t, s) is positive semidefinite if for 


all N > 0, and all tı < t2 < ... < ty, and for all complex constants a), a2,..., an, we have 
NN 
XO) mažgltit;) >0. E 
i=1 j=l 


Using this definition, we can thus say that all correlation and covariance functions must 
be positive semidefinite. Later we will see that this necessary condition is also sufficient. 
Although positive semidefiniteness is an important constraint, it is difficult to apply this 
condition in a test of the legitimacy of a proposed correlation function. 

Another fundamental property of correlation and covariance functions is diagonal 


dominance, 
|[Rxx(t,s)| < JV Rxx(t,t)Rxx(s,s) for all t,s, 


which follows from the Cauchy—Schwarz inequality (cf. Equation 4.3-17). Diagonal domi- 
- nance is implied by positive semidefiniteness but is a much weaker condition. 


9.2 SOME IMPORTANT RANDOM PROCESSES 


In this section we introduce several important random processes. We start with the asyn- 
chronous binary signaling (ABS) process and the random telegraph signal (RTS). We 
continue with the Poisson counting process; the phase-shift keying (PSK) random process, 
an example of digital modulation; the Wiener process, which is obtained as a contin- 
uous limit of a random walk sequence; and lastly introduce the broad class of Markov 
processes. 


Asynchronous Binary Signaling 


A sample function of the asynchronous binary signaling (ABS) process (important for digital 
modulation and computers) is shown in Figure 9.2-1. Each pulse has width T with the 
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X(t) y 





Figure 9.2-1 Sample function realization of the asynchronous binary signaling (ABS) process. (Plotted 
for D=0.) 


random variable X,, indicating the height of the nth pulse, taking on values +a with equal 
probability. 

The sequence is asynchronous because the start time of the nth pulse or, equivalently, 
the displacement D of the Oth pulse is a uniform random variable U(—£, T), For |t2 —t1| < 
T, the sampling instant t} could be on the same pulse containing the sampling instant tı 
or on a different pulse. 

The ABS process can thus be described mathematically by 


x) -DX [A] 


where the pulse (rectangular window) function w(t) is defined as 


afl fojt <4 
w(t) = {0 else. 


The correlation function for this real-valued process is given as 
Rxx (ti, t2) = E[X (ti) X (t2)] 


Taw (SP) y (==) 


n 


=E 





In the ABS process it is assumed the levels of different pulses are independent random 
variables and that these, in turn, are independent of the random displacement D. Since 
E| Xn Xi] = E[Xn]E[Xı] for n Al and E[X?2] = a?, we obtain 


tae pefe (H) e (1) 


EDren # (15) ee) 


n#l l 
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Now, the second term on the right, the one involving the n Æ l products, is zero because 
E[X,] = E[X1] = 0. Also 


Eep (=) oA) 
“Shogo (SB) a 


tg —t tg —f# 
= - Go) w (25) for tz > ty. 


More generally, and for 7 4 te — tı S 0, we can write that 


Rxx(T) = a? ( — m) w (=) (9.2-1) 
since w(|7|) = w(7). 


Equation 9.2-1 is directly extended to the case of equiprobable transitions between two 
arbitrary levels, say a and b. The required modification is 


_1 2 Ir] T a+b\? 
Rxx(T) = AG b) (1 — H) w (sn) + ( z ) 
We leave the derivation of this result as an exercise for the reader. In Figure 9.2-2 we show 
the ABS correlation function Rxx (T) for a = 1,b = 0, and T = 1. 





Poisson Counting Process 


Let the process N (t) represent the total number of counts (arrivals) up to time t. Then we 


can write 
oo 


N(t) ê Y u(t - Tin), 


n=l 


Ry (7) 





Figure 9.2-2 Autocorrelation function of ABS random process for a = 1,6 = 0 and T= 1. 
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Figure 9.2-3 A sample function of the Poisson process running on [0, co). 


where u(t) is the unit-step function and T[n], the time to the nth arrival, is the random 
sequence of times considered in Example 8.1-11. There we showed that the T[n] obeyed the 
nonstationary first-order Erlang density, 


fr(t;n) = a re~**u(t), n>0, (9.2-2) 


which was obtained as an n-fold convolution of exponential pdf’s. A typical sample function 
is shown in Figure 9.2-3, where T[n] = tn and T[n] = Tn. Note that the time between the 
arrivals, 





Tin] = Th] -Tir - 1], 
the interarrival times, are jointly independent and identically distributed, having the expo- 
nential pdf, 
fr(t) = Ae *u(t), 
as in Example 8.1-11. Thus, T|n] denotes the total time until the nth arrival if we begin 
counting at the reference time t = 0. 


Now by the construction involving the unit-step function, the value N(t) is the number 
of arrivals up to and including time t, so 


PING) =n] = P[T[n] <t,T[n+1] > t, 
because the only way that N(t) can equal n is if the random variable T[n] is less than or 
equal to ¢ and the random variable T[n+ 1] is greater than t. If we bring in the independent 


interarrival times, we can re-express this probability as 


P [Tin] <t,7[n+ 1] >t-T[n]], 
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which can be easily calculated using the statistical independence of the arrival time, T[n] 
and the interarrival time T[n + 1] as follows: 


frn | [~ foal aa = f° ZEZE ( [™ reap) da- utt) 
0 t-a o (n-1)! t-a 
= ([ ada) Ae >t /(n — 1)! u(t), 
or, with Py (n;t) £ P{N(t) =n}, 


Pain: t) = OP e-t i 
y (n; t) = n E u(t) fort>0, n>0. (9.2-3) 


We have thus arrived at the PMF of the Poisson counting process and we note that it’s 
equal to that of a Poisson random variable (cf. Equation 2.5-13, see also Equation 1.10-5) 
with mean p = AZ is 

E[N(t)] = At. (9.2-4) 


We call À the mean arrival rate (also sometimes called intensity). It is intuitively satisfying 
that the average value of the process at time t is the mean arrival rate \ multiplied by the 
length of the time interval (0,¢]. We leave it as an exercise for the reader to consider why 
this is so. 

Since the random sequence T'[n] has independent increments (cf. Definition 8.1-4) and 
the unit-step function used in the definition of the Poisson process is causal, it seems reason- 
able that the Poisson process N(t) would also have independent increments. However, this 
result is not clear because one of the jointly independent interarrival times T[n] may be 
partially in two disjoint intervals, hence causing a dependency in neighboring increments. 
Nevertheless, using the memoryless property of the exponential pdf (see Problem 9.8), one 
can show that the independent-increments property does hold for the Poisson process. 

Using independent increments we can evaluate the PMF of the increment in the Poisson 
counting process over an interval (ta, tẹ) as 


— [A (to — ta)” 


PIN (ti) — N (ta) = n] nl 


eMte—tau(n), (9.2-5) 


where we have used the fact that the interarrival sequence is stationary, that is, that À is a 
constant. We formalize this somewhat in the following definition. 


Definition 9.2-1 A random process has independent increments when the set of n 
random variables, - 
X(t), X(t2) — X(t1),..., X(tn) — X (tn-1), 


are jointly independent for all tı < t2 < ... < tn and for aln > 1. B 


This just says that the increments are statistically independent when the corresponding 
intervals do not overlap. Just as in the random sequence case, the independent-increment 
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property makes it easy to get the higher-order distributions. For example, in the case at 
hand, the Poisson counting process, we can write for tz > tı, 


Py (ny, na; tı, te) = PI[N (1) = nı] PIN (t2) — N(tı) = N2 — nı] 


My)" Alta — a) yay 
=í a ean! a wy e™>t-t)u(ni Ju(na — nı), 
which simplifies to 
Nth (t2 — t) 


—At2 
u =n O<f <t. 
milin m) l (m1)u(n2 — nı), Shi < te 


Py (ni, 123 t1, t2) = 
See also Problem 1.54. Using the independent-increments property we can formulate the 
following alternative definition of a Poisson counting process. 


Definition 9.2-2 A Poisson counting process is the independent-increments process 
whose increments are Poisson distributed as in Equation 9.2-5. Jj 


Concerning the moment function of the Poisson process, the first-order moment has 
been shown to be At. This is the mean function of the process. Letting t2 > tı, we can 
calculate the correlation function using the independent-increments property as 


E[N (t2)N(t1)] = E[(N(t1) + [N (t2) — N (t1) N (t1)] 
= E[N? (t1)] + EIN (te) — N (t1) EIN (t1)) 
= Aty + d7#? + Alta — th) Ath 
= Ati + tite. 


If t2 < tı, we merely interchange tı and tz in the preceding formula. Thus the general result 
for all tı and tg is 


Ryn(ti, t2) = E[N(t1) N(t2)] 


2 (9.2-6) 
= Aà min(tı, t2) + AM tite. 
If we evaluate the covariance using Equations 9.2-4 and 9.2-6 we obtain 
Kwyn(ti, te) = Amin(ty, t2). (9.2-7) 


We thus see that the variance of the process is equal to At and is the same as its mean, 
a property inherited from the Poisson random variable. Also we see that the covariance 
depends only on the earlier of the two times involved. The reason for this is seen by writing 
N (t) as the value at an earlier time plus an increment, and then noting that the independence 
of this increment and N (t) at the earlier time implies that the covariance between them must 
be zero. Thus, the covariance of this independent-increments process is just the variance of 
the process at the earlier of the two times. 


Example 9.2-1 — eS 
(radioactivity monitor) In radioactivity monitoring, the particle-counting process can often 
be adequately modeled as Poisson. Let the counter start to monitor at some arbitrary time t 
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and then count for Ty seconds. If the count is above a threshold, say No, an alarm will be 
sounded. Assuming the arrival rate to be A, we want to know the probability that the alarm 
will not sound when radioactive material is present. 

Since the process is Poisson, we know it has independent increments that satisfy the 
Poisson distribution. Thus the count AN in the interval (t,t + To], that is, AN 4 
N(t + To) — N(E), is Poisson distributed with mean ATo independent of t. The probability 
of No or fewer counts is thus 


No k 
PIAN < No] = 5y oy on. 
k=0 i 


If No is small we can calculate the sum directly. If ATọ >> 1, we can use the Gaussian 
approximation (Equation 1.11-9) to the Poisson distribution. 


Example 9.2-2 — SSS 
(sum of two independent Poisson processes) Let Ni(t) be a Poisson counting process with 
rate A,. Let No(t) be a second Poisson counting process with rate 42, where N2 is inde- 


pendent of N,. The sum of the two processes, N(t) £m, (t) + No(t), could model the 
total number of failures of two separate machines, whose failure rates are A, and Az, 
respectively. It is a remarkable fact that N(t) is also a Poisson counting process with rate 
A= Ar +2. 

To see this we use Definition 9.2-2 of the Poisson counting process and verify these 
conditions for N(t). First, it is clear with a little reflection that the sum of two independent- 
increments processes will also be an independent-increments process if the processes are 
jointly independent. Second, for any increment N(t,) — N (ta) with tẹ > ta, we can 
write 

N (te) — N (ta) = Ni (te) — Ni (ta) + Not») — Na(ta). 


Thus the increment in N is the sum of two corresponding increments in N; and No. The 
desired result then follows from the fact that the sum of two independent Poisson random 
variables is also Poisson distributed with parameter equal to the sum of the two parameters 
(cf. Example 3.3-8). Thus the parameter of the increment in N(t) is 


Ai (to = ta) + A2 (to 7 ta) = (Ai + A2) (te = ta) 


as desired. 





The Poisson counting process N (t) can be generalized in several ways. We can let the 
arrival rate be a function of time. The arrival rate À(t) must satisfy A(t) > 0. The average 
value of the resulting nonuniform Poisson counting process then becomes 


x(t) = f Mrjdr, t20. (9.2-8) 


The increments then become independent Poisson distributed with increment means deter- 
mined by this time-varying mean function. Another possible generalization is to two-dimensional 
or spatial Poisson processes that are used to model photon arrival at an image sensor, defects 

on semiconductor wafers, etc. 
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Alternative Derivation of Poisson Process 


It may be interesting to rederive the Poisson counting process from the elementary properties 
of random points in time listed in Chapter 1, Section 1.10. They are repeated here in a 
notation consistent with that used in this chapter. For At small: 


(1) Py (1;t,¢+ At) = A(t) At + o(At). 
(2) Pw (k;t,t + At) = o( At), k>1. 
(3) Events in nonoverlapping time intervals are statistically independent. 


Here the notation o(At), read “little oh,” denotes any quantity that goes to zero at a 
faster than linear rate in such a way that 


PON 
im, At 


and Py(k;t,t + At) = P|N(t + At) — N(t) = k}. 

We note that property (3) is just the independent-increments property for the counting 
process N(t) which counts the number of events occurring in (0, t]. 

We can compute the probability Py (k; t, t+rT) of k events in (t, t+7) as follows. Consider 
Py (k;t,t-+7 + At); if At is very small, then in view of properties (1) and (2) there are only 
the following two possibilities for getting k events in (t,t +r + Ad): 





0, 


FE, = {k in (¢,t+7) and 0 in ((+7,t+7+4At)} or 
Ez = {k — 1 in (t,t+7) and 1 in (t+7,t +7 + At)}. 
Since events & and F> are disjoint events, their probabilities add and we can write 
Py(k;t,t +7 + At) = Py(k;t,t+7)Pn(0;t+7,t+7+ At) 
+ Py(k-—1;t,t+7)Pn(ljt+7,¢+7 + At) 
= Py(k;t,t + 7)[1— A(t +7) At] 
+ Py(k—1;t,t+r)A(t+7)At. 


If we rearrange terms, divide by At, and take limits, we obtain the linear differential 
equations (LDEs), 


dPy(k;t,t +7) 
dr 


Thus, we obtain a set of recursive first-order differential equations from which we can solve 
for Py(k;t,t+7),k =0,1,.... We set Py(—1;t,t+7) = 0, since this is the probability of the 
impossible event. Also, to shorten our notation, we temporarily write Py (k) 4 Py (k;t,t+r); 


thus the dependences on t and 7 are submerged but of course are still there. 
When k=0, 


= A(t + 7)[Pw(k — 1;t,t +7) — Py (k3t,t +7)]. 


dPy(0) _ 


dr —AX(t + 7) Pn (0). 
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This is a simple first-order, homogeneous differential equation for which the solution is 

Pu (0) = Cexp |- | H ne de] 
Since Py (0;t,t) =1,C =1 and 


t+r 
PrO) = exp |- f oae. 
Let us define y by 


t+r 
nE [reas 
Then 
Py (0) = e™. 
When k= 1, the differential equation is now 
dPn(1) +A(t+7)Pwn(1) = A(¢+7)Pw(0) 
dr (9.2-9) 
=X(t+r)e. 


This elementary first-order, inhomogeneous equation has a solution that is the sum of the 
homogeneous and particular solutions. For the homogeneous solution, Ph, we already know 
from the k = 0 case that 
P, = Coe. 
For the particular solution P, we use the method of variation of parameters to assume that 
P = v{t+r)e, 


where v(t + 7) is to be determined. By substituting this equation into Equation 9.2-9 we 
readily find that 

Pp = pe”. 
The complete solution is Py (1) = P, + Pp. Since Py (1;t, t) = 0, we obtain C2 = 0 and thus 


Py (1) = pe’. 


General case. The LDE in the general case is 
dPy (k) 

dr 
and, proceeding by induction, we find that 





+ A(t-+7)Pw(k) = A(t +7) Pw(k — 1) 


pë 
Py (k) = zre” k=0,1,... 


which is the key result. Recalling the definition of p, we can write 


Py(kjt,t +7) = ‘i | J “ alenak] * exp |- f a €)ae] (9.2-10) 


We thus obtain the nonuniform Poisson counting process. 
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Figure 9.2-4 Sample function of the random telegraph signal. 


Another way to generalize the Poisson process is to use a different pdf for the indepen- 
dent interarrival times. With a nonexponential density, the more general process is called a 
renewal process [9-2]. The word “renewal” can be related to the interpretation of the arrival 
times as the failure times of certain equipment; thus the value of the counting process N(t) 
models the number of renewals that have had to be made up to the present time. 


Random Telegraph Signal 


When all the information in a random waveform is contained in the zero crossings, a so- 
called “hard clipper” is often used to generate a simpler yet equivalent two-level waveform 
that is free of unwanted random amplitude variation. A special case is when the number of 
zero crossings in a time interval follows the Poisson law, and the resulting random process 
is called the random telegraph signal (RTS). A sample function of the RTS is shown in 
Figure 9.2-4. 

We construct the RTS on ¢t > 0 as follows: Let X(0) = +a with equal probability. 
Then take the. Poisson arrival time sequence T[n] of Chapter 8 and use it to switch the 
level of the RTS; that is, at T[1] switch the sign of X(t), and then at T[2], and so forth. 
Clearly from the symmetry and the fact that the interarrival times T[n] are stationary and 
form an independent random sequence, we must have that y(t) = 0 and that the first- 
order PMF Px (a) = Px(—a) = 1/2. Next let t2 > tı > 0, and consider the second-order 
PMF Px (21,22) £ P[X(ti) = z1, X (tz) = z2] along with Px (z2 | xı) 2 P[X(t2) = z2 
| X (t1) = zı]. Then we can write the correlation function as 


Rxx (tı, te) = E[X (t1) X (t2)] 
= a° Px(a,a) + (—a)?°Px(—a, —a) + a(—a)Px (a, —a) — a(a)Px(—a, a) 
= 50*(Px(ala) + Px(—al — a) — Px(—ala) — Px (a| — a)), 
since Px(a) = Pxy(—a) = 1/2. But Px(—a| — a) = Px(ala) is just the probability of an 


even number of zero crossings in the time interval (tı, t2], while Px(—ala) = Px (a| — a) is 
the probability of an odd number of crossings of 0. Hence, writing the average number of 


transitions per unit time as À, and substituting T â t2 — tı, we get 
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Rxx(t1, tz) = @ > ew A" > ew A | -= ae- © 1) 1) ie 


even k>0 ` odd k>0 all k>0 


where we have combined the two sums by making use of the function (—1)*, since (—1)* = 1 
for k even and (—1)* = —1 for k odd. Thus we now have 





— r)F 
Rxx (ti, tz) = er > ( 7) = gze-247 
all k>0 ` 


for the case when 7 > 0. Since the correlation function of a real-valued process must be 
symmetric, we have Rxx (tı, t2) = Rxx(t2,t1ı), so that when 7 < 0, we can substitute —7 
into the above equation to get Rx x (tı, t2) = a?et?àT, Thus overall we have, valid for all 
interval lengths 7, 


Rxx (th, t2) = aze2AITI 


A plot of this correlation function is shown in Figure 9.2-5. 


Digital Modulation Using Phase-Shift Keying 


Digital computers generate many binary sequences (data) to be communicated to other 
digital computers. Often this involves some kind of modulation. Binary modulation methods 
frequency-shift these data to a region of the electromagnetic spectrum which is well suited 
to the transmission media, for example, a telephone line. A basic method for modulating 
binary data is phase-shift keying (PSK). In this method binary data, modeled by the random 
sequence B[n], are mapped bit-by-bit into a phase-angle sequence Ofn], which is used to 
modulate a carrier signal cos(27f,t). 


Rixt, t2) 


o m — 
-10 -8 -6 -4 -2 0 2 4 6 8 10 
r=b-tħ 


Figure 9.2-5 The symmetric exponential correlation function of an RTS process (a = 2.0, \ = 0.25). 
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Bin] [ Angie O[n] cos X(t) 
generator generator 


Figure 9.2-6 System for PSK modulation of Bernoulli random sequence Bin]. 








Specifically let B{n] be a Bernoulli random sequence taking on the values 0 and 1 with 
equal probability. Then define the random phase sequence Ofn] as follows: 


tr/2 if Bln| =1, 
Bin] Ê { ar if Ae =0. 


Using @,(t) to denote the analog angle process, we define 
a(t) 2 O[k] for kT <t <(k+ DT, 
and construct the modulated signal as 
X(t) = cos(27r fet + Ga(t)). (9.2-11) 


Here T is a constant time for the transmission of one bit. Normally, T is chosen to be 
a multiple of 1/f. so that there are an integral number of carrier cycles per bit time T. 
The reciprocal of T is called the message or baud rate. The overall modulator is shown in 
Figure 9.2-6. The process X(t) is the PSK process. 

Our goal here is to evaluate the mean function and correlation function of the random 
PSK process. To help in the calculation we define two basis functions, 


A f cos(2rfe) O<t<T 
sef DP else, 


A f sin(2rfet) O<t<T 
sa(t) af ort else, 


which together with Equation 9.2-11 imply 


cos[2x fet + Oa (t)] = cos(@,(t)) cos 27 fet — sin(@,(t)) sin 27 fet 


= 3 cos(@[k])s7(t — kT) — 5 sin(@[k])sq(t — kT), (9.2-12) 
k=—00 poo 


by use of the sum of angles formula for cosines. 

The mean of X (t) can then be obtained in terms of the means of the random sequences 
cos(@[n]) and sin(@[n]). Because of the definition of Ofn], in this particular case cos(@[n]) = 
0 and sin(@[n]) = +1 with equal probability so that mean of X (t) is zero, that is, z(t) = 0. 
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Using Equation 9.2-12 we can calculate the correlation function 
Rxx/(ti, te) = X E{sin Ofk] sin O[l}}sq(ti — kT) sq(te — IT), 
k,l 
which involves the correlation function of the random sequence sin(Ofn]}), 
Rsin@,sino[k, l] = ô[k = I]. 


Thus the overall correlation function then becomes 
+00 


Rxx(ti,te) = J. se(ti — kT)sQ(te — kT). (9.2-13) 
k=—00 
Since the support of sg is only of width T, there is no overlap in (tı, t2) between product 
terms in Equation 9.2-13. So for any fixed (tı, t2), only one of the product terms in the sum 
can be nonzero. Also if tı and tz are not in the same period, then this term is zero also. 
More elegantly, using the notation, 


(t) ĉtmodT and Lt/T] a integer part (t/T), 
we can write that 


Raxx( tite) = {42D sat)) for L/T = Uta 


else. 


In particular for 0 < tı < T and 0 < t2 < T, we have 
Rx x (ti, t2) = se(ti) sq (ta). 


Wiener Process or Brownian Motion 


In Chapter 8 we considered a random sequence X[n] called the random walk in 
Example 8.1-13. Here we construct an analogous random process that is piecewise constant 
for intervals of length T as follows: 


oo 
Xr(t) Y Wikju(t — kT), 
k=1 
where h 
af+s withp=05 
Wik] = { —s withp=0.5 
and u(t) is the continuous unit step function. 
Then X7(nT) = X[n] the random-walk sequence, since 


Xr(nT) = $ Wk] = X{[n]. 
k=1 


Hence we can evaluate the PMFs and moments of this random process by employing the 
known results for the corresponding random-walk sequence. Now the Wiener! process, 


t After Norbert Wiener, American mathematician (1894-1964), a pioneer in communication and estima- 
tion theories. 
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sometimes also called Wiener—Levy or Brownian motion, is the process whose distribution 
is obtained as a limiting form of the distribution of the above piecewise constant process as 
the interval T shrinks to zero. We let s, the jump size, and the interval T shrink to zero in 
a precise way to obtain a continuous random process in the limit, that is, a process whose 
sample functions are continuous functions of time. In letting s and T tend to zero we must 
be careful to make sure that the limit of the variance stays finite and nonzero. The resulting 
Wiener process will inherit the independent-increments property. 

The original motivation for the Wiener process was to develop a model for the chaotic 
random motion of gas molecules. Modeling the basic discrete collisions with a random walk, 
one then finds the asymptotic process when an infinite (very large) number of molecules 
interact on an infinitesimal (very small) time scale. 

As in Example 8.1-13, we let n be the number of trials, k be the number of successes, 


and n — k be the number of failures. Also r 2 k — (n — k) = 2k — n denotes the excess 
number of successes over failures. Then 2k = n+r or k = (n+ r)/2 and must be an 
integer; you cannot have 2.5 “successes.” Thus, n+r must be even and the probability that 
X7(nT) = rs is the probability that there are 0.5(n + r) successes (+s) and 0.5(n — r) 
failures (—s) out of a total of n trials. Thus by the binomial PMF, 


P|Xr(nT) = rs] = nie 27” for n-+r even. 
a 


Ifn+r is odd, then Xr(nT) cannot equal rs. 
The mean and variance can be most easily calculated by noting that the random variable 
X [n] is the sum of n independent Bernoulli random variables defined in Section 8.1. Thus 


E[Xr(nT)] =0 


and 
E[X2(nT)]| = ns?. 


On expressing the variance in terms of t = nT, we have 
2 
Var[Xr(t)] = E[X}(nT)] = t5. 


Thus we need s? proportional to T to get an interesting limiting distribution. We set 
s? = aT, where a > 0. Now as T goes to zero we keep the variance constant at at. Also, by 
an elementary application of the Central Limit theorem (cf. Section 4.7), we get a limiting 
Gaussian distribution. We take the limiting random process (convergence in the distribution 
sense) to be an independent-increments process since all the above random-walk processes 
had independent increments for all T, no matter how small. Hence we arrive at the following 
specification for the limiting process, which is termed the Wiener process: 


px(t)=0,  Var[X(t)] = at 
tThe physical implication of having s? proportional to T is that if we take v 4 s/T as the speed of 


the particle, then the particle speed goes to infinity as the displacement s goes to zero such as to keep the 
product of the two constant. 
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and 





1 z? 
fx(z;t) = 5 P (5) , t>0. (9.2-14) 
The pdf of the increment A £ X(t) — X(r) for all t > 7 is given as 
1 ô? 
fald; t— T) = Vzralt— r) exp (z) ; (9.2-15) 
since 
E[|X(t) — X(r)] = EJA] = 0, (9.2-16) 
and 
E [(X(t) — X(r))?] =a(t-7) fort >r. (9.2-17) 


Example 9.2-3 SSS 
(sample functions) We can use MATLAB to visually investigate the sample functions typical 
of the Wiener process. Since it is a computer simulation, we also can evaluate the effect of 
the limiting sequence occurring as s = VaT approaches 0 for fixed a > 0. 

We start with a 1000-element vector that is a realization of the Bernoulli random vector 
W with p = 0.5 generated as 


u = rand(1000,1) 
w=0.5>=u 


The following line then converts the range of w to +s for a prespecified value s: 
w = s*(2*w - 1.0) 


and then we generate a segment of a sample function of X7(nT) = X[n] as elements of the 
random vector 


x = cumsum(w) 


For the numerical experiment let a = 1.0 and set T = 0.01 (s = 0.1). Using a computer 
variable x with dimension 1000 for T = 0.01, we get the results shown in Figure 9.2-7. Note 
particularly in this near limiting case, the effects of increasing variance with time. Also note 
that trends or long-term waves appear to develop as time progresses. 


From the first-order pdf of X and the density of the increment A, it is possible to 
calculate a complete set of consistent nth-order pdf’s as we have seen before. It thus follows 
that all nth-order pdf’s of a Wiener process are Gaussian. 


Definition 9.2-3 If for all positive integers n, the nth-order pdf’s of a random process 
are all jointly Gaussian, then the process is called a Gaussian random process. W 


The Wiener process is thus an example of a Gaussian random process. The covariance 
function of the Wiener process (which is also its correlation function because py (t) = 0) is 
given as 

Kxx (ti, tz) = amin(f;, t2), a>0. (9.2-18) 
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Figure 9.2-7 A Wiener process sample function approximation for œ = 1 calculated with T = 0.01. 


To show this we take tı > t2, and noting that the (forward) increment X (t1) — X (t2) is 
independent of X(t2) and that they both have zero mean, 


E{(X (ti) — X (ta)) X (t2)] = E[X(t1) — X (te) E[X (t2)] 
=0 
E[X (1) X (t2)] = E[X? (t2)] 


= ate. 


If t2 > tı, we get E[X(t2)X(t1)] = atı, thus establishing Equation 9.2-18. 

Note that the Wiener process has the same variance function as the Poisson process, 
even though the two processes are dramatically different. While the Poisson process consists 
solely of jumps separated by constant values, the Wiener process has no jumps and can in 
fact be proven to be a.s. continuous; that is, the sample functions are continuous with 
probability 1. Later; we will show that the Wiener process is continuous in a weaker mean- 
square sense (specified more precisely in Chapter 10). 


Markov Random Processes 


We have discussed five random processes thus far. Of these, the Wiener and Poisson are 
fundamental in that many other rather general random processes have been shown to be 
obtainable by nonlinear transformations on these two basic processes. In both cases, the 
difficulty of specifying a consistent set of nth-order distributions from processes with depen- 
dence was overcome by use of the independent-increments property. In fact, this is quite a 
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general approach in that we can start out with some arbitrary first-order distribution and 
then specify a distribution for the increment, thereby obtaining a consistent set of nth-order 
distributions that exhibit dependence. 

Another way of going from the first-order probability to a consistent set of nth-order 
probabilities, which has proved quite useful, is the Markov process approach. Here we start 
with a first-order density (or PMF) and a conditional density (or conditional PMF) 


fx(a;t) and fx (x2|r1; te, t1), t2 > tı, 


and then build up the nth-order pdf f(z1,...,£n;t1,.--,tn) (or PMF) as the product, 
J (21; t) f (£221; t2, t1) - -- f (En|En~1; tn, tn-1)- (9.2-19) 


We ask the reader to show that this is a valid nth-order pdf (i.e., that this function is 
nonnegative and integrates to one) whenever the conditional and first-order pdf’s are well 
defined. 

Conversely, if we start with an arbitrary nth-order pdf and repeatedly use the definition 
of conditional probability we obtain, 


f(@1,---,En3t1,...,tn) = f (x1; t1) f (£221; ta, t1) f (£3|£2, 1; tg, te, t1) x 


(9.2-20) 
2X f(0n|en-1,.--,Lijtn,---,t1), 


which can be made equivalent to Equation 9.2-19 by constraining the conditional densities to 
depend only on the most recent conditioning value. This motivates the following definition 
of a Markov random process. 


Definition 9.2-4 (Markov random process) 


(a) A continuous-valued (first-order) Markov process X(t) satisfies the conditional 
PMF expression 


fx (fn|fn-1, Ln—2, ceelin.. ity) = fx (2n|@n—15 tn, tn—1), 


for all £1, 22,...,%n, for all tı < te <... < tn, and for all integers n > 0. 
(b) A discrete-valued (first-order) Markov random process satisfies the conditional PMF 
expression 


Px (@n|2n—1,---,13tn,..-,t1) = Px (n|%n—1; tn, tn—1) 
for all 21,...,2n, for all tı <... < tn, and for all integersn>0. E 


The value of the process X(t) at a given time ¢ thus determines the conditional proba- 
bilities for future values of the process. The values of the process are called the states of the 
process, and the conditional probabilities are thought of as transition probabilities between 
the states. If only a finite or countable set of values z; is allowed, the discrete-valued Markov 
process is called a Markov chain. An example of a Markov chain is the Poisson counting 
process studied earlier. The Wiener process is an example of a continuous-valued Markov 
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process. Both these processes are Markov because of their independent-increments property. 
In fact, any independent-increment process is also Markov. To see this note that, for the 
discrete-valued case, for example, 


Px (tn|tn-1,---,213tn,---,t1) 
= P[X (tn) = Zn|X (tn-1) = Fn-1,.--,X (ti) = £1] 
= P[X(tn) — X(tn—1) = £n — Bn—1|X (tn-1) = En—1,---, X(t) = 21] 
= P[X(tn) — X(tn—-1) = Zn —Zn—i] by the independent-increments property 
( 


= P[X (tn) — X(tn-1) = £n — Pn—1|X(tn-1) = Zn_i] again by independent increments 





= P[X (tn) = £n|X (tn-1) = En-1] 
= Px (an|2n-13 tn; ti). 


Note, however, that the inverse argument is not true. A Markov random process does not 
necessarily have independent increments. (See Problem 9.17.) 

Markov random processes find application in many areas including signal processing, 
communications, and control systems. Markov chains are used in communications, computer 
networks, and reliability theory. 


Example 9.2-4 — > 
(multiprocessor reliability) Given a computer with two independent processors, we can 
model it as a three-state system: 0—both processors down; 1—exactly one processor up; 
and 2—both processors up. We would like to know the probabilities of these three states. A 
common probabilistic model is that the processors will fail randomly with time-to-failure, 
the failure time, exponentially distributed with some parameter À > 0. Once a processor 
fails, the time to service it, the service time, will be assumed to be also exponentially 
distributed with parameter u > 0. Furthermore, we assume that the processor’s failures and 
servicing are independent; thus we make the failure and service times in our probabilistic 
model jointly independent. 

If we define X(t) as the state of the system at time t, then X is a continuous-time 
Markov chain. We can show this by first showing that the times between state transitions 
of X are exponentially distributed and then invoking the memoryless property of the expo- 
nential distribution (see Problem 9.8). Analyzing the transition times (either failure times 
or service times), we proceed as follows. The transition time for going from state X = 0 to 
X = 1 is the minimum of two exponentially distributed service times, which are assumed 
to be independent. By Problem 3.26, this time will be also exponentially distributed with 
parameter 24. The expected time for this transition will thus be 1/(2) = 3(1 /p), that 
is, one-half the average time to service a single processor. This is quite reasonable since 
both processors are down in state X = 0 and hence both are being serviced indepen- 
dently and simultaneously. The rate parameter for the transition 0 to 1 is thus 2u. The 
transition 1 to 2 awaits one exponential service time at rate u. Thus its rate is also p. 
Similarly, the state transition 1 to 0 awaits only one failure at rate A, while the transition 
2 to 1 awaits the minimum of two exponentially distributed failure times. Thus its rate 
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Figure 9.2-8 Short-time state-transition diagram with indicated transition probabilities. 


is 2\. Simultaneous transitions from 0 to 2 and 2 to 0 are of probability 0 and hence are 
ignored. 

This Markov chain model is summarized in the short-time state-transition diagram of 
Figure 9.2-8. In this diagram the directed branches represent short-time, that is, as At — 0, 
transition probabilities between the states. The transition times are assumed to be expo- 
nentially distributed with the parameter given by the branch label. These transition times 
might be more properly called intertransition times and are analogous to the interarrival 
times of the Poisson counting process, which are also exponentially distributed. 

Consider the probability of being in state 2 at t + At, having been in state 1 at time t. 
This requires that the service time T, lies in the interval (t,t + At] conditional on Ts > t. 


Let P;(t) & P[X(t) = i] for 0 < i < 2. Then 
P(t + At) = P;(t)P[t < Te < t + At|T, > t], 
where 
Fr, (t + At) — Fr, (t) 
1 — Fr, (t) 
Using this type of argument for connecting the probability of transitions from states at time 
t to states at time t + At and ignoring transitions from state 2 to state 0 and vice versa 


enables us to write the state probability at time t+ At in terms of the state probability at 
t in vector matrix form: 


P|t < Ts < t+ At|T, > t] = = pAt + o(At). 


Po(t + At) 1— 2u At AAt 0 Po(t) 
Pi(t+At)| =|] 2At 1-(A+p)At 2dAt P,(é) | + o( At), 
P(t + At) 0 pAt 1—2XrAt} | P(t) 


where o(At) denotes a quantity of lower order than At. 
Rearranging, we have 


Po(t+At)—Po(t)] [-2% A 0 ] [Poi(t) 
P,(t+ At)—P,(t)| = | 24 (å+ u) 2A P,(t) | At + o(At). 
Py (t + At) — P(t) 0 H —2Xr P2(t) 
Dividing both sides by At and using an obvious matrix notation, we obtain 
IPE _ APG). (9.2-21) 


dt 
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The matrix A is called the generator of the Markov chain X. This first-order vector differ- 


ential equation can be solved for an initial probability vector, P(0) 4 Po, using methods of 
linear-system theory [9-3]. The solution is expressed in terms of the matrix exponential 


1 1 
e^t 274 Att (At? + a (At) +... 
which converges for all finite t. The solution P(t) is then given as 
P(t) = e“*Po, t>0. 


For details on this method as well as how to obtain an explicit solution, see [9-4]. 

For the present we content ourselves with the steady-state solution obtained by setting 
the time derivative in Equation 9.2-21 to zero, thus yielding AP=0. From the first and last 
rows we get 


—2uPo + AP, =0 


and 
+P — 2AP = 0. 


From this we obtain P, = (2u/A)Pp and Po = (u/2\)P, = (u/A)?Po. Then invoking 
Po + P, + Po = 1, we obtain Py = A? /(A? + 2 + p?) and finally 


1 
= SD), 2, p’). 
yaar nan] 
Thus the steady-state probability of both processors being down is Py = [A/(à + p)]’. 
Incidentally, if we had used only one processor modeled by a two-state Markov chain, we 
would have obtained Po = A/(A + p). 





Clearly we can generalize this example to any number of states n with independent expo- 
nential interarrival times between these states. In fact, such a process is called a queueing 
process. Other examples are the number of toll booths busy on a superhighway and conges- 
tion states in a computer or telephone network. For more on queueing systems, see [9-2]. 
An important point to notice in the last example is that the exponential transition times 
were crucial in showing the Markov property. In fact, any other distribution but exponential 
would not be memoryless, and the resulting state-transition process would not be a Markov 
chain. 


Birth-Death Markov Chains 


A Markov chain in which transitions are permissible only between adjacent states is called 
a birth-death chain. We first deal with the case where the number of states is infinite and 
afterwards treat the finite-state case. 
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Mea Ài+1 


Bi Mi+2 
Hi+i 


Figure 9.2-9 Markov state diagram for the birth-death process showing transition rate parameters. 


1. Infinite-length queues. The state-transition diagram for the infinite-length queue is 
shown in Figure 9.2-9.1 In going from state i to state i+1, we say that a birth has occurred. 
Likewise, in going from state 7 to state i— 1 we say a death has occurred. At any time t, P;(t) 
is the probability of being in state j, that is of having a “population” of size j, in other 
words the excess of the number of births over deaths. In this model, births are generated by 
a Poisson process. The times between births Tg, and the time between deaths Tp, depend 
on the states but obey the exponential distribution with parameters A; and z;, respectively. 
The model is used widely in queuing theory where a birth is an arrival to the queue and a 
death is a departure of one from the queue because of the completion of service. An example 
is people waiting in line to purchase a ticket at a single-server ticket booth. If the theater is 
very large and there are no restrictions on the length of the queue (e.g., the queue may block 
the sidewalk and create a hazard), overflow and saturation can be disregarded. Then the 
dynamics of the queue are described by the basic equation Wp = max{0,W,-1+7s — Ti}, 
where W,, is the waiting time in the queue for the nth arrival, Ts is the service time for 
the (n — 1)st arrival, and 7; is the interarrival time between the nth and (n — 1)st arrivals. 
This is an example of unrestricted queue length. On the other hand data packets stored in a 
finite-size buffer memory present a different problem. When the buffer is filled (saturation), 
a new arrival must be turned away (in this case we say the datum packet is “lost”). 
Following the procedure in Example 9.2-4, we can write that 


P(t + At) = BP(t), 


where 
1 —AoAt yu, At 0 s 
AoAt 1 _ (ài + pi) At [ig At 0 ore 
B= 0 At 1—(A2+pg)At pgAt 0 


Rearranging and dividing by At and letting At — 0, we get 
dP(t)/dt = AP(t), 


tIn keeping with standard practice, we draw the diagram showing only the transition rate parameters 
that is, the w,;’s and A,’s over the links between states. This type of diagram does not show explicitly, for 
example, that in the Poisson case the short-time probability of staying in state i is 1— (A; +y,;) At. While this 
type of diagram is less clear, it is less crowded than, say, the nonstandard short-time transition probability 
diagram in Figure 9.2-8. 
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where P(t) = [Po(t), Pi(t),..., Pj(t),...]", and A, the generator matris for the Markov 
chain is given by 


ào (à + 44) H2 0 
À 


In the steady state P’(t) = 0. Thus, we obtain from AP = 0, 
P = Pi Po, 
Pz = pP, = p\P2Po, 


Pj = p;Pj-1 = P4 tt P2Pi Po, 


where p; £ Aj-1/M,, for j > 1. 

Assuming that the series converges, we require that Do P; = 1. With the notation 
rj 4 P;°** P21, and ro = 1, this means Po 072g ri = 1 or Po = 1/3072 7i- Hence the 
steady-state probabilities for the birth-death Markov chain are given by 


co 
Pi =r] Sor, j = 0. 
i=0 


Failure of the denominator to converge implies that there is no steady state and therefore 
the steady-state probabilities are zero. This model is often called the M/M/1 queue. 


2. M/M/1 Queue with constant birth and death parameters and finite storage L. 
Here we assume that A; = A and 4; = p, for all i, and that the queue length cannot 
exceed L. This stochastic model can apply to the analysis of a finite buffer as shown in 
Figure 9.2-10. The dynamical equations are 


dPo(t)/dt = —APo(t) + uP, (t) 
AP,(t)/dt = +APo(t) — (A+ u) Pi (t) + mPa) 


dP, (t)/dt = +APr_i(t) — Pr (t). 


Note that the first and last equations contain only two terms, since a death cannot occur in 
an empty queue and a birth cannot occur when the queue has its maximum size L. From 
these equations, we easily obtain that the steady-state solution is P; = piP), for 0 < i < L, 


where p âi /p. From the condition that the buffer must be in some state, we obtain that 
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Buffer of size L 


Figure 9.2-10 Illustration of packet arriving at buffer of finite size L. 


Lio P*Po = 1, or that Pp = (1 — p)/(1 — p+). Saturation occurs when the buffer is full. 
The steady-state probability of this event is Pr = p”(1 — p)/(1 — p”++). Thus for a birth 
rate which is half the death rate, and a buffer of size of 10, the probability of saturation is, 
approximately, 5 x 1074. 


Example 9.2-5 
(average queue size) In computer and communication networks, packet switching refers to 
the transmission of blocks of data called packets from node to node. At each node the packets 
are processed with a view toward determining the next link in the source-to-destination 
route. The arrival time of the packets, the amount of time they have to wait in a buffer, 
and the service time in the CPU (the central processing unit) are random variables. 

Assume a first-come, first served, infinite-capacity buffer, with exponential service time 
with parameter u, and Poisson-distributed arrivals with a Poisson rate parameter of A 
arrivals per unit time. We know from earlier in this section that the interarrival times of 
the Poisson process are i.i.d. exponential random variables with parameter À. The state 
diagram for this case is identical to that of Figure 9.2-9 except that p, = H3 =... = p and 
ào = Ay =... = A. Then specializing the results of the previous discussion to this example, 
we find that P; = pP), for 0 < i, where p 4 A/p and P = (1 — p). Thus, in the steady 
state P; = p'(1 — p), and the average number of packets in the queue E[N], is computed 
from 





oo 


—~VYip=—? 


We leave the details of this elementary calculation as an exercise. 


Example 9.2-6 — > > 
(finite capacity buffer) We revisit Example 9.2-5 except that now the arriving data packets 
are stored in a buffer of size L. Consider the following set-up: The data stored in the buffer 
are processed by a CPU on a first-come, first-service basis. 
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Assume that, say at time t, the buffer is filled to capacity, and that there is a packet being 
processed in the CPU and an arriving packet on the way to the buffer. If the interarrival 
time T; between this packet and the previous one is less than 7,, the service time for the 
packet in the CPU, the arriving packet will be lost. The probability of this event is 


P[‘packet loss”] = P [“saturation” N {Ts > Ti} 
= p"(1—p)/(1—p"*") x Pits — T: > 0], 


since the event’s “saturation” and {T, > T:} are independent. Since 7, and T; are inde- 
pendent, the probability P[7,, — T; > 0] can easily be computed by convolution. The result 
is P[T, — T; > 0] = A/(A+ u). The probability of losing the incoming packet is then 


P[“packet loss”] = p”(1 — p)/(1— p”**) x p/(1 + p), 


which, for p = 0.5, yields P{“packet loss”] = 1.6 x 10~4 for the buffer of size 10, with arrival 
rate equal to half the service rate. 








Chapman-Kolmogorov Equations 


In the examples of a Markov random sequence in Chapter 8, we specified the transition 
density as a one-step transition, that is, from n — 1 to n. More generally, we can specify 
the transition density from time n to time n+ k, where k > 0, as in the general definition 
of a Markov random sequence. However, in this more general case we must make sure that 
this multistep transition density is consistent, that is, that there exists a one-step density 
that would sequentially yield the same results. This problem is even more important in the 
random process case, where due to continuous time one is always effectively considering 
multistep transition densities; that is, between any two times t2 Æ tı, there is a time in 
between. 

For example, given a continuous-time transition density fx (r2|21;t2,t1), how do we 
know that an unconditional pdf fx (x;t) can be found to satisfy the equation 


+00 
fx(22;te) = fx (2|1; ta, ti) fx (z1; tı)dzı 


—-co 


for all tg > tı, and all zı and z2? 

The Chapman-Kolmogorov equations supply both necessary and sufficient conditions 
for these general transition densities. There is also a version of the Chapman—Kolmogorov 
equations for the discrete-valued case involving PMFs of multistep transitions. 

Consider three times t3 > t2 > tı and the Markov process random variables at these 
three times X (t3), X(t), and X(t,). We wish to compute the conditional density of X (t3) 
given X (tı). First, we write the joint pdf 


+00 


fx (x3, 21;t3,t1) = fx (x3|x2, 213 t3, te, t1) fx (2, 213 te, ty) dre. 
—co 
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If we now divide both sides of this equation by f(x1;t1), we obtain 


+00 


fx(æslz1) = J fx (z3l22, 21) fx (w2|er)dre, 


—co 
where we have suppressed the times t for notational simplicity. Then using the Markov 


property the above becomes 


+00 
fx (#3|21) = fix ("3|22) fx (wa|x1)daxe, (9.2-22) 


which is known as the Chapman-—Kolmogorov equation for the transition density fx (z3|z1) 
of a Markov process. This equation must hold for all tg > t2 > tı and for all values of x3 and 
xı. It can be proven that the Chapman—Kolmogorov condition expressed in Equation 9.2-22 
is also sufficient for the existence of the transition density in question [9-5]. 


Random Process Generated from Random Sequences 


We can obtain a Markov random process as the limit of an infinite number of simulations 
of Markov random sequences. For example, consider the random sequence generated by the 
equation 

X[n] = pX[n — 1] + W[n], -œ < n < +00, 


as given in Example 8.4-6 of Chapter 8, where |p| < 1.0 to ensure stability. There we found 
that the correlation function of X [n] was 


Rxx[m] = o% p™!, 


where 0%, is the variance of the independent random sequence W [n]. Replacing X [n] with 
X(nT), and setting X(t) = X[nT] for nT < t< (n +1)T, we get 


Rxx(t+T,t)= of pT = o% exp(—alr]), 


where a Ê 4 In 4 or alternatively p = exp(—aT'). Thus, if we generate a set of simulations 


with Tk 4 To/k for k = 1,2,3,..., and then for each simulation set py 4 */exp(—aTp), we 
will get a set of denser and denser approximations to a limiting random process X (t), that 
is WSS with correlation function 


Rxx(t+T,t) = 0%, exp(—alr]). 


9.3 CONTINUOUS-TIME LINEAR SYSTEMS WITH RANDOM INPUTS 


In this section we look at transformations of stochastic processes. We concentrate on the 
case of linear transformations with memory, since the memoryless case can be handled by 
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the transformation of random variables method of Chapter 3. The definition of a linear 
continuous-time system is recalled first. 


Definition 9.3-1 Let z(t) and z(t) be two deterministic time functions and let a; 
and az be two scalar constants. Let the linear system be described by the operator equation 
y = L{z}. Then the system is linear if 


L{aizı(t) + agre(t)} = a, L{x1(t)} + a2L4{z2(t)} (9.3-1) 
for all admissible functions zı and zz and all scalars a; and az. E 


This amounts to saying that the response to a weighted sum of inputs must be the 
weighted sum of the responses to each one individually. Also, in this definition we note 
that the inputs must be in the allowable input space for the system (operator) L. When 
we think of generalizing L to allow a random process input, the most natural choice is to 
input the sample functions of X and find the corresponding sample functions of the output, 
which thereby define a new random process Y. Just as the original random process X is 
a mapping from the sample space to a function space, the linear system in turn maps this 
function space to a new function space. The cascade or composition of the two maps thus 
defines an output random process. This is depicted graphically in Figure 9.3-1. Our goal in 
this section will be to find out how the first- and second-order moments, that is, the mean 
and correlation (and covariance), are transformed by a linear system. 


Theorem 9.3-1 Let the random process X (t) be the input to a linear system L with 
output process Y(t). Then the mean function of the output is given as 


Ely (t)] = L{E[X(¢)]} 


(9.3-2) 
= L{ux(t)}- 
Input sample function 
x (t)=X(g,t) 

Linear 

t system 
. \ 
\ 
\ 
yi(t)=Y(Z,.t) 


Sample space Q 


> Output sample function 
t 


Figure 9.3-1 Interpretation of applying a random process to a linear system. 
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Proof (formal). By definition we have for each sample function 
Y(t, c) = L{X(t,0)} 
so 
EY (t)] = E[L{X(t)}}. 


If we can interchange the two operators, we get the result that the mean function of the 
output is just the result of L operating on the mean function of the input. This can be 
heuristically (formally) justified as follows, if we assume the operator L can be represented 
by the superposition integral: 

+00 


Y(t) = J h(t,)X (r)dr. 


—c 


Taking the expectation, we obtain 


EY (©) = E | J N h(t, 7)X(r)dr 


+00 
-f h(t, 7) E[X (T)]dr 
= L{ux(t)} E 


We present a rigorous proof of this theorem after we study the mean-square stochastic 
integral in Chapter 10. For now, we will assume it is valid, and next look at how the corre- 
lation function is transformed by a linear system. There are now two stochastic processes 
to consider, the input and the output, and the cross-correlation function E[X(t1)Y*(t2)| 
comes into play. We thus define the cross-correlation function 


Rxy (t1, ta) 2 E[X(t1)Y*(t2)]. 


From the autocorrelation function of the input Rxx(tı,t2), we first calculate the 
cross-correlation function Rxy (t1,t2) and then the autocorrelation function of the output 
Ryy (ti, t2). If the mean is zero for the input process, then by Theorem 9.3-1 the mean 
of the output process is also zero. Thus the following results can be seen also to hold for 


covariance functions by changing the input to the centered process X;,(t) 2x (t) — px (t), 
which produces the centered output Y.(¢) 2 Y(t) — py (t). 


Theorem 9.3-2 Let X(t) and Y (t) be the input and output random processes of the 
linear operator L. Then the following hold: 


Rxy (tı, ta) = D3{Rx x (ti, ta)}, (9.3-3) 
Ryy (t1, t2) = In{Rxy (ti, t2)}, (9.3-4) 


where L; means the time variable of the operator L is t;. 
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Proof (formal). Write 
X(ti)¥" (te) = X(t1)Lo{X" (t2)} 
= 13{X (t1) X"(t2)}, 


where we have used the adjoint operator L* whose impulse response is h*(t,7), that is, the 
complex conjugate of h(t, 7). Then 


EIX (tr) ¥*(ta)] = E LHX (ta) X* (to) 
= L3{E[X(t1)X*(t2)]} by interchanging L3 and E, 
= L3{Rxx (tr, t2)}, 
which is Equation 9.3-3. Similarly, to prove Equation 9.3-4, we multiply by Y*(é2) and get 
Y (t1)Y* (t2) = Li {X(h)¥*(t2)} 
so that 
E[Y (t1)Y*(t2)] = E [Li {X (t1)Y* (t2)}] 
= L {E[X(t1)Y*(t2)]} by interchanging Lı and E, 
= Li{Rxy (tı, ta)}, 
which is Equation 9.3-4. If we combine Equation 9.3-3 and Equation 9.3-4, we get 
Ryy (tı, t2) = Lı L3{Rxx(tı,t2)}- B (9.3-5) 


Example 9.3-1 — SSS 
(edge or “change” detector) Let X(t) be a real-valued random process, modeling a certain 


sensor signal, and define Y(t) 4 L{X (t)} £ X(t) — X(t — 1) so 
EY (t)) = L{ux(t)} = ux(t) — ux(t — 1). 


Also 
Rxy (ti, t2) = L2{Rxx(tı,t2)} = Rxx(t, te) — Rxx(tı, t2 — 1) 


and 
Ryy (tı, t2) = L1 {Rxy (tı, t2)} = Rxy (ti, t2) — Rxy (tı — 1, t2) 
= Rxx(ti, te) — Rxx(tı — 1, t2) — Rxx (th, te — 1) 
+ Rxx(tı — 1, t2 — 1). 


To be specific, if we take y(t) = 0 and 


A 
Rx x(t, t2) = o% exp(—alti — təl), 
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Ryx(7) 


Figure 9.3-2 Input correlation function Rxx of Example 9.3-1 versus T = tı — t2. 


then 
E[Y(t)] =0 since Ly = 0, 
and 
Ryy (ti, ta) = o% (2exp(—alti — t2|) — exp(—altı — tz — 1|) — exp(—at; — tz + 1))). 
We note that both Rxx and Rxy are functions only of the difference of the two 
observation times tı and tg. The input correlation function Rx x is plotted in Figure 9.3-2, 
for a = 2 and o% = 2. Note the negative correlation values in output correlation function 


Ryy, shown in Figure 9.3-3, introduced by the difference operation of the edge detector. 
The variance of Y (t) is constant and is given as 


oY (t) = oy = 20%[1 — exp(—a)]. 
We see that as a tends to zero, the variance of Y goes to zero. This is because as a tends 


to zero, X(t) and X(t — 1) become very positively correlated, and hence there is very little 
power in their difference. 


Example 9.3-2 — >s 
(derivative process) Let X(t) be a real-valued random process with constant mean function 
Hx (t) = p and covariance function 


Kxx(t,s) = 0? coswo(t — s). 
We wish to determine the mean and covariance function of the derivative process X’(t). 
Here the linear operator is d(-)/dt. First we determine the mean, 
d 


x(t) = EXO) = SEXO) = Zux(t) = Žu =o. 
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Ryy(r) 


4 


Figure 9.3-3 Output correlation function Ryy of Example 9.3-1 versus 7 = tı — to. 


Now, for this real-valued random process, the covariance function of X’(t) is 
Kx: x:(ti,t2) = E{X'(t1)X' (t2)], 


since i, (¢) = 0. Thus by Equation 9.3-5, with X’(t) = Y (t), 


ð re] fe) ð 
Kx'x' (tı, t2) = Bt, (sp; Koex(tste)) = ah (a cos W(t, — t) 


= — (woo? sinwo(ti — t2)) = (woo)? coswo(ti — t2). 


Oty 
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We note that the result is just the original covariance function scaled up by the factor wg. 
This similarity in form happened because the given Ky x(t,s) is the covariance function 
of a sine wave with random amplitude and phase (cf. Example 9.1-5). Since the phase is 


random, the sine and its derivative the cosine are indistinguishable by shape. 








White Noise 


Let the random process under consideration be the Wiener process of Section 9.2. Here 
we consider the derivative of this process. For any a > 0, the covariance function of 
the Wiener process is Ky x(t1,t2) = amin(t,,t2) and its mean function uy = 0. Let 
W(t) = dX(t)/dt. Then proceeding as in the above example, we can calculate wy(t) = 


E|dX (t)/dt] = du (t)/dt = 0. For the cavariance, 
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Kxx(tta)) 


amin(ty, ta)) 


_ 3 o ata, ta < tı 
~ ðt \ôðtz lati, t22tı 


— o a, t2<t) 

Oty 0, tg > ti 
_ o 0, ti<tz 
~ a, >to 


= gge — t2) 


= ad(ty — t2). 


Thus the covariance function of white noise is the impulse function. Since white noise 
always has zero mean, the correlation function too is an impulse. It is common to see 


Rww (ti, te) = o76(ty — ta) = Kwwi (t1, te), (9.3-6) 


with a replaced by a”, but one should note that the power in this process E[|W(t)|?] = 
o*6(0) = œ, not a7. In fact, o? is a power density for the white noise process. 

Note that the sample functions are highly discontinuous and the white noise process is 
not separable.t 


9.4 SOME USEFUL CLASSIFICATIONS OF RANDOM PROCESSES 


Here we look at several classes of random processes and pairs of processes. These classifica- 
tions also apply to the random sequences studied earlier. 


Definition 9.4-1 Let X and Y be random processes. They are 


(a) Uncorrelated if Rxy (ti, te) = ux (t1)u}-(te), for all t; and tz; 

(b) Orthogonal if Rxy(t1,t2) = 0 for all tı and to; 

(c) Independent if for all positive integers n, the nth-order CDF of X and Y factors, 
that is, 


Fry (£1, Y1, £2; Y2,- -Ens Yniti, ---,tn) 
= Fy (21,.-.,2njti,...,tn)Fy(yi,---,yniti,---, bn); 
for all x;, y; and for all t),...,tn. E 


tThe idea of separability (cf. Section 9.1) is to make a countable set of points on the é-axis (e.g., time- 
axis) determine the properties of the process. In effect it says that knowing the pdf over a countable set of 
points implies knowing the pdf everywhere. See [9-6]. 
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Note that two random processes are orthogonal if they are uncorrelated and at least one 
of their mean functions is zero. Actually, the orthogonality concept is useful only when the 
random processes under consideration are zero-mean, in which case it becomes equivalent to 
the uncorrelated condition. The orthogonality concept was introduced for random vectors 
in Chapter 5. This concept will prove useful for estimating random processes and sequences 
in Chapter 11. 

A random process may be uncorrelated, orthogonal, or independent of itself at earlier 
and/or later times. For example, we may have Rx x(t1,t2) = 0 for all ti Æ t2, in which 
case we call X an orthogonal random process. Similarly X(t) may be independent of 
{X(t1),.-.,X(tn)} for all t ¢ {t1,...,tn} and for all t),...,¢, and for all n > 1. Then we 
say X(t) is an independent random process. Clearly, the sample functions of such processes 
will be quite rough, since arbitrarily small changes in t yield complete independence. 


Stationarity 


We say a random process is stationary when its statistics do not change with the continuous 
parameter, often time. The formal definition is: 


Definition 9.4-2 A random process X (t) is stationary if it has the same nth-order 
CDF as X(t + T), that is, the two n-dimensional functions 


Fx (21,..-,;2n3ti,..-;tn) = Fy (a,..-,2nj tı +T,...,tn +T) 
are identically equal for all T, for all positive integers n, and for all t,,...,t,. E 
When the CDF is differentiable, we can equivalently write this in terms of the pdf as 
fx(£1,.-- Eniti,- - tn) = fx(21,---,2njti +T,..- tn +T), 
and this is the form of the stationarity condition that is most often used. This definition 
implies that the mean of a stationary process is a constant. To prove this note that f(x; t) = 
f(z;t+T) for all T implies f(z; t) = f(z;0) by taking T = —t, which in turn implies that 
E|X (t)] = ux (t) = wx (0), a constant. 
Since the second-order density is also shift invariant, that is, 
f (21,22; t1, te) = f(a, £2;t1 + T,t2+T), 
we have, on choosing T = —t2, that 
f (x1, £2; t1, t2) = f(1, £2; t1 — t2,0), 
which implies E[X(t,)X*(t2)] = Rxx(ti — t2,0). In the stationary case, therefore, the 
notation for correlation function can be simplified to a function of just the shift 7 4 ty — te 


between the two sampling instants or parameters. Thus we can define the one-parameter 
correlation function 


Rxx(r) 2 Rxx(T,0) 
= BX (t+ 7)X*(6)], 


(9.4-1) 
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which is functionally independent of the parameter t. Examples of this sort of correlation 
function were seen in Section 9.3. 

A weaker form of stationarity exists which does not directly constrain the nth-order 
CDFs, but rather just the first- and second-order moments. This property, which is easier 
to check, is called wide-sense stationarity and will be quite useful in what follows. 


Definition 9.4-3 A random process X is wide-sense stationary (WSS) if E|X(t)] = 
fx, a constant, and E[X (t + 7)X*(t)] = Rxx(r) for all —co < T + œo, independent of the 
time parameter t. W 


Example 9.4-1 


(WSS complex exponential) Let X(t) £ Aexp(j2rft) with f a known real constant and 
A a real-valued random variable with mean E[A] = 0 and finite average power E[A?]. 
Calculating the mean and correlation of X(t), we obtain 





E[X(t)] = E[Aexp(j2mft)] = E[A] exp(j27 ft) = 0, 
and 
E[X(t4+.1)X*(t)] = E[Aexp(j2af(t + 7))A exp(—j2mft)| = E[A?] exp(j2afr) = Rxx(r). 


Note that E[A] = 0 is a necessary condition for WSS here. Question: Would this work with 
a cosine function in place of the complex exponential? 





‘The process in Example 9.4-1, while shown to be wide-sense stationary, is clearly not 
stationary. Consider, for example, that X (0) must be pure real while X(1/(4f)) must always 
be pure imaginary. We thus conclude that the WSS property is considerably weaker than 
stationarity. | 

We can generalize this example to have M complex sinusoids and obtain a rudimentary 
frequency domain representation for zero-mean WSS random processes. Consider 


M 
X(t) = X` Ag exp(j2m frt), 


k=1 


where the generally complex random variables A; are uncorrelated with mean zero and 
variances oĉ. Then the resulting random process is WSS with mean zero and autocorrelation 
(or autocovariance) equal to 


M 
Rxx(r) = 5 o? exp(j2r fkr). (9.4-2) 
k=1 


For such random processes X (t), the set of random coefficients {Ax} constitutes a frequency 
domain representation. From our experience with Fourier analysis of deterministic functions, 
we can expect that as M became large and as the fẹ became dense, that is, the spacing 
between the f, became small and they cover the frequency range of interest, most random 
processes would have such an approximate representation. Such is the case (cf. Section 10.6). 
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9.5 WIDE-SENSE STATIONARY PROCESSES AND LSI SYSTEMS 


In this section we treat random processes that are jointly stationary and of second order, 
that is, 

E||X(#)?] < 00. 
Some important properties of the auto- and cross-correlation functions of stationary second- 
order processes are summarized as follows. They, of course, also hold for the respective 
covariance functions. 


(1) |Rxx(r)| <  Rxx(0), which, for the real case, directly follows from 
E||X(t+ 7) — X()/7] > 0. 

(2) |Rxy(r)| < /Rxx(0)Ryy(0), which is derived using the Schwarz inequality. (cf. 
Section 4.3. Also called diagonal dominance.) It also proves the complex case of 1. 

(3) Rxx(r) = R&x(-T), since ELX(t + r)X*(t)] = E[X()X*(t - 7] = 
E*(X(t—r)X*(t)] for WSS random processes, which is called the conjugate symmetry 
property. In the special case of a real-valued process, this property becomes that of 
even symmetry, that is, 

3a. Rxx(r) = Rxx(-7). 


Another important property of the autocorrelation function of a complex-valued, 
stationary random process is that it must be positive semidefinite, that. is, 
(4) for all N > 0, all tı < t2 < ... < ty and all complex aj, a2,...,an, 


N N 


XO Y ara} Rx x (te — tı) > 0. 


k=1 l=1 


This was shown in Section 9.1 to be a necessary condition for a given function 
g(t, s) = g(t — s) to be an autocorrelation function. We will show that this prop- 
erty is also a sufficient condition, so that positive semidefiniteness actually charac- 
terizes autocorrelation functions. In general, however, it is very difficult to check 
property (4) directly. 


To start off, we can specialize the results of Theorems 9.3-1 and 9.3-2, which were derived 
for the general case, to LSI systems. Rewriting Equation 9.3-2 we have 


EY (t)] = L{ux (t)} 
= T bx (T)A(t — 7) dr 
= px (t) * h(t). 


Using Theorem 9.3-2 and Equations 9.3-3 and 9.3-4, we get also 


+20 
Rxy (ti, t2) = J h*(T2)Rx x (t1, te — T2)dT2, 
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and 
+00 
Ryy (ti, t2) = J h(T1)Rxy (tı — 71, te)dr1, 
—co 

which can be written in convolution operator notation as 

Rxy (ti, te) = h* (t2) * Rx x(t, te), 
where the convolution is along the t2-axis, and 

Ryy (ti, t2) = h(t1) * Rxy (tı, ta), 
where the convolution is along the tı-axis. Combining these two equations, we get 
Ryy (t1, t2) = h(t) * Rxx(t, ta) * h* (t2). 
Wide-Sense Stationary Case 
If we input the stationary random process X(t) to an LSI system with impulse response 


h(t), then the output random process can be expressed as the convolution integral, 


+00 
Y(t) = J h(r)X(t — r)dr, (9.5-1) 
when this integral exists. Computing the mean of the output process Y (t), we get 
+00 . 
EY (t)] = J h(r)E[X(t—7)]dr by Theorem 9.3-1, 


+00 +00 - 
= f Mouxdr=ux | h(r)ar, O52) 


where H(w) is the system’s frequency response. 

We thus see that the mean of the output is constant and equals the mean of the input 
times the system function evaluated at w = 0, the so-called “de gain” of the system. If we 
compute the cross-correlation function between the input process and the output process, 
we find that 


Ryx(r) = E[Y(t+7)X*(t)] 
= E[Y(t)X*(t—7)] by substituting t — 7 for t, 
- J 19 (a) BIX (t — a) X" (t — 7)]da, 


and bringing the operator E inside the integral by Theorem 9.3-2, 


+00 
= i h(a)Rxx(T — ajda, 
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which can be rewritten as 
Ry x (rT) = h(T) * Rxx(r). (9.5-3) 


Thus, the cross-correlation Ry x equals h convolved with the autocorrelation Rx x. This 
fact can be used to identify unknown systems (see Problem 9.28). 

The output autocorrelation function Ryy(r) can now be obtained from Ry x(rT) as 
follows: 


Ryy(r) = E[Y(¢+7)Y*(0)] 
= E[Y()Y*(t—7)] by substituting t for t — T, 


~ f ~ h* (a) E[Y (t)X*(t —7 — a) da 
= J ve h*(a)E[Y (t)X*(t — (T +.a))]da 


+00 
= / h*(a)Ry x(7 + a)da 


—0o 


+00 
= J h*(—a)Ryx(T — a)da 


= h*(—r) * Ryx(r). 
Combining both equations, we get 
Ryy (7) = h(r) * h*(—7) * Rxx(7). (9.5-4) 


We observe that when Rxx(r) = 6(rT), then the output correlation function is 
Ryy(T) = h(r) * h*(—7), which is sometimes called the autocorrelation impulse response 
(AIR) denoted as g(r) = h(r) * h*(—r). Note that g(r) must be positive semidefinite, and 
indeed FT {g(7)} = |H(w)|? > 0. 

Similarly, we also find (proof left as an exercise for the reader) 


+00 
Rxy(T) = f h*(—a)Rxx(T — a)da (9.5-5a) 


= h*(-r) * Rxx(T), 
and 
+00 
Ryy(r) = J h(a)Rxy (T — ajda 
= h(r) * Rxy(r) 
= h(r) + h*(—1) * Rxx(r) 
= g(r) * Rxx(r). 
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This elegant and concise notation is shorthand for 


Ryy(r) = [2 9(7')Rxx(7 — T')dr' (a convolution) (9.5-5b) 
g(t’) = [> h*(a)h(at+7’)da. (a correlation product) (9.5-5c) 


Example 9.5-1  — S o 
(derivative of WSS process) Let the second-order random process X(t) be stationary with 
one-parameter correlation function Rx (r) and constant mean function py (t) = py. Consider 
the system consisting of a derivative operator, that is, 


_ dX(t) 
dt 


Using the above equations, we find py (t) = du (t)/dt = 0 and cross-correlation func- 
tion 


Y(t) 


Rxy(r) = ui(—T) * Rxx(r) 
_ _4Rxx(r) 
dr ’ 
since the impulse response of the derivative operator is h(t) = dô(t)/dt = u, (t), the (formal) 
derivative of the impulse 6(t), sometimes called the unit doublet.t 
Ryy (rT) = u1(7) * Rxy (T) 
_ dRxy(r) 
dr 
_ d*Rxx (T) 
dr? ` 


Notice the AIR function here is g(7) = —u2(T), minus the second (formal) derivative of 
(7). 





Power Spectral Density 


For WSS, and hence for stationary processes, we can define a useful density for average 
power versus frequency, called the power spectral density (psd). 


Definition 9.5-1 Let Rxx(r) be an autocorrelation function. Then we define the 
power spectral density Sx x(w) to be its Fourier transform (if it exists), that is, 
+00 
Sxx(w) Ê Rxx(r)e*" dr. m (9.5-6) 
—oo 
Under quite general conditions one can define the inverse Fourier transform, which 
equals Rx x(r) at all points of continuity, 


1 +00 . 
Rxx(r) = = J Sx x (weti dw. (9.5-7) 


tIn this u-function notation, u_1(t) = u(t) the unit step function, and uo(t) = (t) the unit impulse 
[9-9]. 
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Table 9.5-1 Correlation Function Properties of Corresponding Power Spectral Densities 





Random Process Correlation Function Power Spectral Density 
X(t) Rx x (rT) Sxx(w) 
aX(t) la}? Rxx(7) la|’ Sxx (w) 


Xı and X2 orthogonal Rxx, (T) + Rxax2(T)  Sx,x,(w) + Sx.x2(w) 


X'(t) —d* Rx x(r)/dr? w?Sxx (w) 

x™ (t) (—1)° d” Rx x (T)/dr”" w” Sxx (w) 

X(t) exp(jwot) exp(jwoT) Rx x (rT) Sxx(w — wo) 

X(t) cos(wot + O) 
with independent O ¿Rxx (7) cos(woT) +[Sxx (w+ wo) + Sxx(w — wo)] 
uniform on [—7, +71] 

X(t)+b (E[X(t)]}=0) Rxx(r) + |b)? Sxx (w) + 2|b|?5(w) 


In operator notation we have, 
Sxx = FT{Rxx} 


and 
Rxx = IFT{Sxx}, 


where FT and IFT stand for the respective Fourier operators. 

The name power spectral density (psd) will be justified later. All that we have done 
thus far is define it as the Fourier transform of Rx x(r). We can also define the Fourier 
transform of the cross-correlation function Rxy(7r) to obtain a frequency function called 
the cross-power spectral density, 

+00 
Sxy (w) 4 Rxy (r)e~%" dr. (9.5-8) 
00 
We will see later that the psd Sxx(w), is real and everywhere nonnegative and in fact, 
as the name implies, has the interpretation of a density function for average power versus 
frequency. By contrast, the cross-power spectral density has no such interpretation and is 
generally complex valued. 


We next list some properties of the psd Sxx(w): 


1. Sxx(w) is real valued since Rxx(r) is conjugate symmetric. 

2. If X(t) is a real-valued WSS process, then Sx x (w) is an even function since Rx x(T) 
is real and even. Otherwise Sx x(w) may not be an even function of w. 

3. Sxx(w) > 0 (to be shown in Theorem 9.5-1). 


Additional properties of the psd are shown in Table 9.5-1. One could go on to expand 
this table, but it will suit our purposes to stop at this point. One comment is in order: We 
note the simplicity of these operations in the frequency domain. This suggests that for LSI 
systems and stationary or WSS random processes, we should solve for output correlation 


598 Chapter 9 Random Processes 





functions by first transforming the input correlation function into the frequency domain, 
carry out the indicated operations, and then transform back to the correlation domain. 
This is completely analogous to the situation in deterministic linear system theory for 
shift-invariant systems. 

Another comment would be that if the interpretation of Sx x (w) as a density of average 
power is correct, then the constant or mean component has all its average power concen- 
trated at w = 0 by the last entry in the table. Also by the next-to-last two entries in 
the table, modulation by the frequency wo shifts the distribution of average power up in 
frequency by wo. Both of these results should be quite intuitive. 


Example 9.5-2 0 > 
(power spectral density of white noise) The correlation function of a white noise process 
W(t) with parameter g? is given by Rww(r) = o76(7). Hence the power spectral density 
(psd), its Fourier transform, is just 


Sww(w)=07, -œ <w < +0. 


The psd is thus flat, and hence the name, white noise, by analogy to white light, which 
contains equal power at every wavelength. Just like white light, white noise is an idealiza- 
tion that cannot physically occur, since as we have seen earlier Rww (0) = oo, necessitating 
infinite power. Again, we note that the parameter a? must be interpreted as a power density 
in the case of white noise. 





An Interpretation of the psd 
Given a WSS process X (t), consider the finite support segment, 
A 
Xr(t) = X (t)li-r,+r(t), 


where Jj_7,47; is an indicator function equal to 1 if -T < t < +T and equal to 0 otherwise, 
and T > 0. We can compute the Fourier transform of Xr by the integral 


+T 
FT{X7(t)} = J X (teit dt. 
-T 
The magnitude squared of this random variable is 
+T +T 
IFT{Xr(t)}? = J X (ty) X* (tz) -1-ta dt dtp. 
-T J-T 
Dividing by 2T and taking the expectation, we get 
1 1 +T p+r , 
sn [|FT{Xr(t)}?] = 5 Rxx (ti — ta)e#—*) dt, dt, (9.5-9a) 
2T 2T Jr Jr 


To evaluate the double integral on the right, introduce the new coordinate system s = 
tı + t2,7 = tı — t2. The relationship between the (s,7) and (t1,t2) coordinate systems 
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Figure 9.5-1 (a) Square region in (tı, t2) plane; (b) integration in diamond-shaped region created by 
the transformation s = ti + b,7 = th — te. 


is shown in Figure 9.5-la. The Jacobian (scale-change) of this transformation is 1/2 and 
the region of integration is the diamond-shaped surface p shown in Figure 9.5-1b, which is 
Figure 9.5-la rotated counterclockwise 45° and whose sides have length TV2. The double 
integral in Equation 9.5-9a then becomes 


1 ` 
IT JI Rxx(T)e “7 dr ds 
p 
1 0 2T+r 
= -5 R Tje I"T f ds| dr 
4T Ja xx(7) —(2T+r) 


1 2T , 2T—T +2T |r| ; 
— —jwr — O it -jwt dr. 
+ ITA Rxx(r)e i orn ds ar) f ar | H] Rxx(r)e dr 


In the limit as T — +00, this integral tends to Equation 9.5-6 for an integrable Rx x; 
thus 


Sxx(w) = Jim SE [IFT{Xr()}?] (9.5-9b) 


so that Sx x(w) is real and nonnegative and is related to average power at frequency w. 
We next look at two examples of the computation of psd’s corresponding to correlation 
functions we have seen earlier. 


Example 9.5-3 — > > 
Find the power spectral density for the following exponential autocorrelation function with 
parameter a > 0: 

Rxx(r) = exp(—a|r|), —oo < 7 < +o. 


This is the autocorrelation function of the random telegraph signal (RTS) discussed in 
Section 9.2. Its psd is computed as 
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+20 ; +00 l 
Sxx(w) =f Rxx(r)e rar = f eTelTleTivT dr 


—oo —00 
0 o0 
= J elie)" dr + J e (etio) dr 
—0o 0 
= 2a/[a? +w], -o < w < +o. 


This function is plotted in Figure 9.5-2 for a = 3. We see that the peak value is at the origin 
and equal to 2/a. The “bandwidth” of the process is seen to be a on a 3 dB basis (if Sx x 
is indeed a power density, to be shown). We note that while there is a cusp at the origin of 
the correlation function Rx x, there is no cusp in its spectral density Sx x. In fact Sxx is 
continuous and differentiable everywhere. (It is true that Sxx will always be continuous if 
Rxx is absolutely integrable.) 

Figure 9.5-2 was created using MATLAB with the short m-file: 


clear alpha=3; 
b = [1.0 0.0 alpha*2]; 
w = linspace(-10,+10); 


den = polyval(b,w); 
num = 2*alpha; 

S = num./den; 

plot (w,S) 


We note that the psd decays rather slowly, and thus the RTS process requires a signif- 
icant amount of bandwidth. The reason the tails of the psd are so long is due to the jumps 
in the RTS sample functions. 
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Figure 9.5-2 Plot of psd for exponential autocorrelation function. 
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Example 9.5-4 
(psd of triangular autocorrelation) Consider an autocorrelation function that is triangular 
in shape such that the correlation goes to zero at shift T > 0, 





Rxx(T) = max h — Fol . 
T 
One way this could arise is the asynchronous binary signaling (ABS) process introduced in 
Section 9.2. This function is plotted as Figure 9.5-3. If we realize that this triangle can be 
written as the convolution of two rectangular pulses, each of width T and height 1//T, 
then we can use the convolution theorem of the Fourier transform [9-3,-4] to see that the 
psd of the triangular correlation function is just the square of the Fourier transform of the 
rectangular pulse, that is, the sinc function. The transform of the rectangular pulse is 


sin(wT/2) 
vT (wT/2) ’ 


and the power spectral density Sx x of the triangular correlation function is thus 


: 2 
Sxx(w) =T (=) , (9.5-10) 


As a check we note that Sx x (0) is just the area under the correlation function, that in the 
triangular case is easily seen to be T. Thus checking, 














Another way the triangular correlation function can arise is the running integral average 
operating on white noise. Consider 





Figure 9.5-3 A triangular autocorrelation function. 


602 Chapter 9 Random Processes 











Figure 9.5-4 Plot of equation versus sı for t > T. 


with W(t) a white noise with zero mean and correlation function Rww (T) = 6(r). Then 
Lx (t) = 0 and E[X (#1) X (é2)] can be computed as 


ti t2 


Rxx(tı;, t2) = TJ rj ww (ss — 82)ds, ds2 
1— 2— 


1 tı ta 
=a [/ 6(89 — sı) dsa! dsı. 
T tı-T te-T 


Now defining the inner integral as 


t2 
A — 1,t2—-T < 81 < tə, 
ge (51) = f, ô(s2 — 81)ds2 = f else, 


which as a function of sı looks as shown in Figure 9.5-4, so 
1 f? 
Rxx (ti, te) = F Jta ($1)ds1 
tı-T 


= max [i - #0] . 


More on White Noise 
The correlation function of white noise is an impulse (Equation 9.3-6), so its psd is a constant 
Sww (w) = 07, —co < w < +00. 


The name white noise thus arises out of the fact that the power spectral density is constant 
at all frequencies just as in white light, which contains all wavelengths in equal amounts.t 
Here we look at the white noise process as a limit approached by a sequence of second-order 


tA mathematical idealization! Physics tells us that, for realistic models, the power density must tend 
toward zero as w — oo. 
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processes. To this end consider an independent increment process (cf. Definition 9.2-1) with 
zero mean such as the Wiener process (Rx x (t1, t2) = a? min(t,, t2)) or a centered Poisson 
process, that is, N.(é) = N(t) — At, with correlation Ryn, (t1, t2) = A min(t1,t2). Actually 
we need only uncorrelated increments here; thus we require X(t) only to have uncorrelated 
increments. For such processes we have by Equation 9.2-17, 


E xe +A)- x()?| =aA, 


where œ is the variance parameter. 
Thus upon letting X,(t) denote the first-order difference divided by A, 


Xa(t) Ê [X(t+ A) — X(t))/A, 
we have 
E|X3 (t)] = a/A 
and 
E[Xa(ti)Xa(t2)] =0 for |te—t)|>A. 


If we consider |tz — t;| < A, we can do the following calculation, which shows that the 
resulting correlation function is triangular, just as in Example 9.5-4. Since X(t; +A) —X (tı) 
is distributed as N (0, A), taking tı < t2 and shifting tı to 0, and t2 to f2—t1, the expectation 
becomes 


aE IX(A) (X(t ~ ty +A) — X(t2 ~ t1))] 
_ ai BIX(A) (X(A) — X(te—t1))] since (A, tz — tı + AJN (0, A] = ¢, 


= alas - a(t — t1)] = ŠI ~ (t2 —)/A), 


Thus, the process generated by the first-order difference is WSS (the mean is zero) and has 
correlation function Raa (T) given as 


Raa (T) = 2 max | — Irl o|. 
A 
We note from Figure 9.5-5 that as A goes to zero this correlation function tends to a delta 
function. 
Since we just computed the Fourier transform of a triangular function in Example 9.5-4, 
we can write the psd by inspection as 


sin(wA/2)\? 
S = —_— ]. 
This psd is approximately flat out to |w| = 7/(3A). As A — 0, Saa(w) approaches the 
constant œ everywhere. Thus as A — 0, Xa(t) “converges” to white noise, the formal 
derivative of an uncorrelated increments process, 
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Figure 9.5-5 Correlation function of Xa (t). 


2 


2 i 
ôtıðta 


2 


Rýxġ(tı,t2) = o* min(tı, t2)] 


= a lo?ulty — ta)] 


= o°ô(tı — t2). 


If one has a system that is continuous in its response to stimuli, then we say that the system 
is continuous; that is, the system operator is a continuous operator. This would mean, for 
example, that the output would change only slightly if the input changed slightly. A stable 
differential or difference equation is an example of such a continuous operator. We will see 
that for linear shift-invariant systems that are described by system functions, the response 
to the random process Xa (t) will change only slightly when A changes, if A is small and if 
the systems are lowpass in the sense that the system function tends to zero as |w| — +oo. 
Thus the white noise can be seen as a convenient artifice for more easily constructing this 
limiting output. (See Problem 9.36.) 
If we take Fourier transforms of both sides of Equation 9.5-3 we obtain the cross-power 
spectral density, 
Syx(w) = H(w)Sxx(w). (9.5-11) 


Since Syx is a frequency-domain representation of the cross-correlation function Ry x, 
Equation 9.5-11 tells us that Y (t) and X (t) will have high cross correlation at those frequen- 
cies w where the product of H(w) and Sxx(w) is large. Similarly, from Equation 9.5-5, we 
can obtain 

Sxy (w) = H” (w)Sxx (w). (9.5-12) 


From the fundamental Equation 9.5-4, repeated here for convenience, 


Ryy (rT) = h(T) * Rxx(T) *h*(—-7), (9.5-13) 
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we get, upon Fourier transformation, in the spectral density domain, 
Syy(w) = |H(w)|?Sxx (w) = G(w)Sxx (w). (9.5-14) 


These two equations are among the most important in the theory of stationary random 
processes. In particular, Equation 9.5-14 shows how the average power in the output process 
is composed solely as the average input power at that frequency multiplied by |H(w)|*, the 
power gain of the LSI system. We can call G(w) = |H(w)|? the psd transfer function. 


Example 9.5-5 
(average power) The transfer function of an LSI system is given by 








H(w) = slo) (2) exp |- (w- 2)| wee) 


where sgn(-) is the algebraic sign function, and where the frequency window function 


A f1, for |w| < 407 
Ww) = fa else. 


Let the WSS input random process have autocorrelation function, 
5 
Rxx(r) = z8) +2. 


Compute the average measurable power in the band 0.0 to 1.0 Hertz (single-sided). In 
radians, this is the double-sided range —27 to 27. First we Fourier transform Rx x (T) to 


5 
obtain Sxx (w) = 5 terel). Next we compute the psd transfer function G(w) = |H(w)|? = 


w 4 
(=) W (w). The output psd then is 


Ww 


2y Ww), 


and the total average output power would be calculated as 


1 ttre va 
Ryy (0) = xl. 3 (=) dw, 


Syy (w) = 2 ( 


while the power in the band [—27, +27] is 
1 f??™57w 4 
P=— = (= 
20 I. 2 (z) dw 
= 1 watt. 


The following comment on Equations 9.5-3 through 9.5-14 may help you keep track 
of the conjugates and minus signs. Notice that the conjugate and negative argument on 
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the impulse response, which becomes simply a conjugate in the frequency domain, arises 
in connection with the second factor in the correlation. The h(7) without the conjugate or 
negative time argument comes from the linear operation implied by the first subscript, that 
is, the first factor in the correlation. 

With reference to Equation 9.5-11 we see that the cross-spectral density function can 
be complex and hence has no positivity or conjugate symmetry properties, since those 
that Sxx has will be lost upon multiplication with an arbitrary, generally complex H. 
On the other hand, as shown in Equation 9.5-14, the psd of the output will share the 
real and nonnegative aspects of the psd of the input, since multiplication with |H|? will 
not change these properties. Table 9.5-2 sets forth all the above relations for easy 
reference. 

We are now in a position to show that the psd S(w) has a precise interpretation as a 
density for average power versus frequency. We will show directly that S(w) > 0 for all w 
and that the average power in the frequency band (w1, w2) is given by the integral of S(w) 
over that frequency band. 


Theorem 9.5-1 Let X(t) be a stationary, second-order random process with correla- 
tion function Rxx(T) and power spectral density Sxx(w). Then Sxx(w) > 0 and 
for all wz > wy, 


= f Soe x (w) da 


2m Ja 


is the average power in the frequency band (w1, w2). 


Proof Let wz > w both be real numbers. Define a filter transfer function as follows: 


A fl, w€ (wi,we) 
H(w) = fa else, 


Table 9.5-2 Input/Output Relations for Linear Systems with WSS Inputs 


WSS Random Process: Output Mean: 

Y(t) = h(t) * X(t) py = H(0)ux 
Crosscorrelations: Cross-Power Spectral Densities: 

Rxy (T) = Rxx(T)* h*(-7) Sxy(w) = Sxx(w)H*(w) 

Ryx(r) = h(r) * Rxx(r) Syx(w) = H(w)Sxx(w) 

Ryy (T) = Ryx (T) * h*(—1) Syy(w) = Sy x (w)H* (w) 
Autocorrelation: Power Spectral Density: 

Ryy (r) = h(T) * Rxx (T) * h*(—r) Syy (w) = |H(w)|?Sxx (w) 

= g(T)* Rxx(r) = G(w)Sx x (w) 


Output Power and Variance: 
E{|Y(t)[?} = Ryy (0) = $ JIS |Hw)/?Sxx (w)dw 


oy = Ryy (0) — |My? 
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and note that it passes signals only in the band (w,w2). If X(t) is input to this filter, the 
psd of the output Y(t) is (by Equation 9.5-14) 


S ? € kA 
Srv(w) = { xxl) “ Wwa) 


Now the output power in Y(t) has average value E||Y(t)|?] = Ryy (0), 


1 +00 1 w2 
Ryy(0) = + J Syy(u)dw= = f Sxx(w)dw >0, 


2 00 Wy 


and this holds for all w2 > w . So by choosing we ~ w we can conclude that Sxx(w) > 0 
for all w and that the function Sx x thus has the interpretation of a power density in the 
sense that if we integrate this function across a frequency band, we get the average power 
in that band. W 


We saw earlier that the conditions that a function must meet to be a valid correlation 
or covariance function are rather strong. In fact, we have seen that the function must be 
positive semidefinite, although we have not in fact shown that this condition is sufficient. 
It turns out that one more advantage of working in the frequency domain is the ease with 
which we can specify when a given frequency function qualifies as a power spectral density. 
The function simply must be real and nonnegative, that is, S(w) > 0. We can see this for 
a given function F(w) > 0 by taking a filter with transfer function H(w) = ./F(w) and 
letting the input be white noise with Sww = 1. Then by Equation 9.5-14 the output psd is 
Sxx(w) = F(w), thus showing that F is a valid psd. Hf the random process is real valued, 
as it most often is, then we also need F'(w) to be an even function to satisfy psd property 
(2) listed just after Definition 9.5-1. All this can be formalized as follows. 


Theorem 9.5-2 Let F(w) be an integrable function that is real and nonnegative; 
that is, F(w) > 0 for all w. Then there exists a stationary random process with power 
spectral density S(w) = F(w). If the random process is to be real valued, then F(w) must 
be an even function of w. E 


We now see that the test for a valid spectral density function is much easier than the 
condition of positive semidefiniteness for the correlation function. In fact, it is relatively 
easy to show that the positive semidefinite condition on a function is equivalent to the 
nonnegativity of its Fourier transform, and hence that positive semidefiniteness is the suffi- 
cient condition for a function to be a valid correlation or covariance function. First, by 
Theorem 9.5-2 we know that the positive semidefinite condition is implied by the nonneg- 
ativity of S(w). To show equivalence, it remains to show that the positive semidefinite 
condition on a function f(r) implies that its Fourier transform F(w) is nonnegative. We 
proceed as follows: Since f(r) is positive semidefinite we have, 


N N 


> > anâmf (Tn -= Tm) > 0. 


n=l m=1 
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Also since 
1 ste 


f(r) Ffw dw, 


= On _ 


we have 
1 +00 , 
mo (ona, J Flw)etetre—)) dw > 0, 
n n m oo 


which can be rewritten as 


2 
+00 


if * otju(tTa-Tm) 1 
a J- O EEan i w= zj ŢȚ EO 


n 


dw > 0, 


N 
X ne tiern 


n=l 








where we recognize the term inside the magnitude square sign as a so-called transversal 
or tapped delay-line filter. Thus by choosing N large enough, with the 7, equally spaced, 
we can select the an’s to arbitrarily approximate any ideal filter transfer function H(w). 
Then by choosing H to be very narrow bandpass filters centered at each value of w, we can 
eventually conclude that F(w) > 0 for all w, —o0 < w < +00. We have thereby established 
the following theorem. 


Theorem 9.5-3 A necessary and sufficient condition for f(T) to be a correlation 
function is that it be positive semidefinite. [J 


Incidentally, there is an analogy here for probability density functions, which can be 
regarded as the Fourier transforms of their CFs. As we know, nonnegativity is the sufficient 
condition for a function to be a valid pdf (assuming that it is normalized to integrate to 
one); thus the probability density is analogous to the power spectral density; and in fact one 
can define a spectral distribution function [9-7] analogous to the cumulative distribution 
function. Thus the CF and the correlation function are also analogous and so both must be 
positive semidefinite to be valid for their respective roles. Also for the CF the normalization 
of the probability density to integrate to one imposes the condition ®(0) = 1, which is easily 
met by scaling an arbitrary positive semidefinite function that is not identically zero. 


Stationary Processes and Differential Equations 


We shall now examine stochastic differential equations with a stationary or at least WSS 
input, and also with the linear constant-coefficient differential equation (LCCDE) valid for 
all time. We assume that the equation is stable in the bounded-input, bounded-output 
(BIBO) sense, so that the resulting output process is also stationary (or WSS if that is the 
condition on the input process). 

Thus consider the following general LCCDE: 


ay Y) (t) + an_1Y-)) (t) +... + aoY (t) 
= bMX V(t) + bm XD (t) +... +boX (t),  —œ < t< +00. 
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This represents the relationship between output Y(t) and input X(t) in a linear system 
with frequency response 


H(w) = Bw)/AWw), with ap £ 0, 


where 
M 
Bw) = SO bm(jw)” 
m=0 
and 
A N 
Alw) Ê Y an(jw)", 
n=0 


which is a rational function with numerator polynomial B(w) and denominator polynomial 
A(w). Because the system is stable, we can apply the results of the previous section to 
obtain 
By = ux H (0) 
Sy x(w) = H(w) Sxx (w), 


and 
Syy (w) = |H (w)? Sxx (w), 
where 
H(0) =bo/a9 and |A(w)|? = |B(w)?/|A(w)/?. 
So 


hy = (bo/ao)ux and Syy(w) = (|B(w)/?/|AW)|?) Sxx lw). 


This frequency-domain analysis method is generally preferable to the time-domain 
approach but is restricted to the case where both the input and output processes are at least 
WSS. After we obtain the various spectral densities, then we can use the IFT to obtain the 
correlation and covariance functions if they are desired. The calculation of the required IF'T's 
is often easier if viewed as an inverse two-sided Laplace transform. The Laplace transform 
of Equation 9.5-3 is 


Sy x(s) = H(s)Sxx(s) (9.5-15) 
while the Laplace transform of Equation 9.5-13 is written 
Syy (s) = H(s)H(—s)Sxx(s) (9.5-16) 


in light of h*(—r) © H(—s). Recalling the definition of the two-sided Laplace trans- 
form [9-3], for any f(T) 


Fo) Ê [T fear, 
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we note that such a function of the complex variable s may be obtained from the Fourier 
transform F'(w), a function of the real variable w, by a two-step procedure. First set 


F(s)s—jw = F(w) 


and then replace jw by s. An analogous extension method was used earlier for the discrete- 
time case in Chapter 8 where the Fourier transform was extended to the entire complex 
plane by the Z-transform. 


Example 9.5-6 
(output correlation—first-order system) Consider the first-order differential equation 











Y'(t)+ aY(t) = X(t), a > 0, 


with stationary input X(t) with mean ux = 0 and impulse covariance function Kx x(T) = 
5(r). The system function is easily seen to be 





and the psd of the input process is 
S XX (w) = 1, 


so we have the following cross- and output-power spectral densities: 


Syx (u) = H()Sxx(w) = i 


1 1 
Syy (w) = |H (w)? Sx x(w) = Jat jw? a Fo? 





We now convert to Laplace transforms, with s = jw, 


, 1 
Syy (jw) (o? E (ju)*) 
1 
~ (a+ jw)(a— jw) 
so that 


1 
VO Galera) 


Using the residue method (cf. Appendix A) or partial fraction expansion, one can then 
directly obtain the following output correlation function by inverse Laplace transform: 


1 
Ryy(r) = 57 exp(—alr|), -o < 7 < +00, 
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which is also the output covariance function since zy = 0. By the above equation for 
Sy x(w) we also obtain the cross-correlation function Ry x(T) = exp(—arT)u(rT). 


In Example 9.5-6 it is interesting that Ryx(7) is 0 for r < 0. This means that the 
output Y is orthogonal to all future values of the input X, a white noise in this case. This 
occurs because of two reasons: The system is causal and the input is a white noise process. 
The system causality requires that the output not depend directly on (i.e., not be a function 
of) future inputs but only depend directly on present and past inputs. The whiteness of the 
input X guarantees that the past and present inputs will be uncorrelated with future inputs. 
Combining both conditions we see that there will be no cross-correlation between the present 
output and the future inputs. If we assume additionally that the input is Gaussian, then the 
input process is an independent process and the output becomes independent of all future 
inputs. Then we can say that the causality of the system prevents the direct dependence 
of the present output on future inputs, and the independent process input prevents any 
indirect dependence. These concepts are important to the theory of Markov processes as 
used in estimation theory (cf. Chapter 11). 


Example 9.5-7 
(output correlation function—second-order system) Consider the following second-order 
LCCDE: 





Y(t) _dY(t) _ 
aw +3 u +2Y (t) = 5X(t), 


again with white noise input as in the previous example. Here the system function is 





5 5 


HO) = Goeyajet2 Ow) +70" 


Thus analogously to Example 9.5-6 the output psd becomes 


Syy(w) = 25 _ 2% 
yy (2 — w2)? + (3w)? wt + Sw? + 4° 
Applying the residue method to evaluate the IFT, we define the function of a complex 


variable Syy(s)|s—ju 4 Syy(w) and rewrite the right-hand side in terms of the complex 
variable jw to obtain 


Syy (jw) = Z —. 
(jw)* — 5(jw)? +4 
Substituting s = jw, we get 
Syv() = SEE 
which factors as 
5 5 


G++) Cs +A) 7 IHC), 
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where H(s) is the Laplace transform system function. Then the inverse Laplace transform 
yields the output correlation function 


i i 
Ryy(r) = 25 li exp(—|t|) — 7p oxP(—2Irl) ; -0 < T < +00. 


We leave the details of the calculation to the interested reader. 


9.6 PERIODIC AND CYCLOSTATIONARY PROCESSES 


Besides stationarity and its wide-sense version, two other classes of random processes are 
often encountered. They are periodic and cyclostationary processes and are here defined. 


Definition 9.6-1 A random process X(t) is wide-sense periodic if there is a T > 0 
such that 
ux(t)= ux(t +T) for all t 
and 
Kxx(tı, t2) = Kxx(ti + T, t2) = Kx x(t, te + T) for all t1, t2. 
The smallest such T is called the period. Note that Kx x (tı, t2) is then periodic with period 
T along both axes. $ 


An example of a wide-sense periodic random process is the random complex exponential 
of Example 9.4-1. In fact, the random Fourier series representation of the process 


X(t) = do Ae exp (7) (9.6-1) 
k=1 


with random variable coefficients A, would also be wide-sense periodic. A wide-sense peri- 
odic process can also be WSS, in which case we call it wide-sense periodic stationary. We will 
consider these processes further in Chapter 10, where we also refer to them as mean-square 
periodic. The covariance function of a wide-sense periodic process is generically sketched in 
Figure 9.6-1. We see that Kx x (t1,t2) is doubly periodic with a two-dimensional period of 
(T,T). In Chapter 10 we will see that the sample functions of a wide-sense periodic random 
process are periodic with probability 1, that is, 


X(t)=X(t+T) for allt, 


except for a set of outcomes, i.e. an event, of probability zero. 

Another important classification is cyclostationarity. It is only partially related to peri- 
odicity and is often confused with it. The reader should carefully note the difference in the 
following definition. Roughly speaking, cyclostationary processes have statistics that are 
periodic, while periodic processes have sample functions that are periodic. 


Definition 9.6-2 A random process X (t) is wide-sense cyclostationary if there exists 
a positive value T such that 
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Figure 9.6-1 Possible contours of the covariance function of a wide-sense (WS) periodic random 
process. 


and 


Kxx(t1, t2) = Kxx(ti + T,t2 +T) for allt; and t2. E 


An example of cyclostationarity is the random PSK process of Equation 9.2-11. Its 
mean function is zero and hence trivially periodic. Its covariance function (Equation 9.2-13) 
is invariant to a shift by T in both its arguments. Note that Equation 9.2-13 is not doubly- 
periodic since Rx x (0,T) = 0 #4 Rxx(0,0). Also note that the sample functions of X(t) are 
not periodic in any sense. 

The constant-value contours of the covariance function of a typical cyclostationary 
random process are shown in Figure 9.6-2. Note the difference between this configuration 
and that of a periodic random process, as shown in Figure 9.6-1. Effectively, cyclostationarity 
means that the statistics are periodic, but the process itself is not periodic. 

By averaging along 45° lines (i.e., tı = t2), we can get the WSS versions of both types 
of processes. The contours of constant density of the periodic process then become the 
straight lines of the WSS periodic process shown in Figure 9.6-3. The WSS version of a 
cyclostationary process just becomes an ordinary WSS process, because of the lack of any 
periodic structure along 135° (anti-diagonal) lines (i.e., tı = —t2). 

In addition to modulators, scanning sensors tend to produce cyclostationary processes. 
For example, the line-by-line scanning in television transforms the random image field into 
a one-dimensional random process that has been modeled as cyclostationary. In communica- 
tions, cyclostationarity often arises due to waveform repetition at the baud or 
symbol rate. 
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Figure 9.6-2 Possible contour plot of covariance function of WS cyclostationary random process. 




















Figure 9.6-3 Possible contour plot of covariance function of WSS periodic random process. (Solid 
lines are maxima; dashed lines are minima.) 


A place where cyclostationarity arises in signal processing is when a stationary random 
sequence is analyzed by a filter bank and subsampled. The subsequent filter bank synthesis 
involves upsampling and reconstruction filters. If the subsampling period is N, then the 
resulting synthesized random sequence will be cyclostationary with period N. When perfect 
reconstruction filters are used, then true stationarity will be achieved for the synthesized 
output. 
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While cyclostationary processes are not stationary or WSS except in trivial cases, it is 
sometimes appropriate to convert a cyclostationary process into a stationary process as in 
the following example. 





Example 9.6-1 
(WSS PSK) We have seen that the PSK process of Section 9.2 is cyclostationary and 
hence not WSS. This is easily seen with reference to Equation 9.2-13. This cyclostationarity 
arises from the fact that the analog angle process O,(t) is stepwise constant and changes 
only at t = nT for integer n. In many real situations the modulation process starts at an 
arbitrary time t, which in fact can be modeled as random from the viewpoint of the system 
designer. Thus in this practical case, the modulated signal process (Equation 9.2-11) is 
converted to 


X(t) = cos (27 fet + g(t) + 27 fold), (9.6-2) 
by the addition of a random variable Tọ, which is uniformly distributed on [0, T] and inde- 
pendent of the angle process O,(¢). It is then easy to see that the mean and covariance 


functions need only to be modified by an ensemble average over To, which by the uniformity 
of Tp is just an integral over [0,7]. We thus obtain 


1 T 
Regt tr) = 5 | Rxx(ty +7+t,t, + t)dt 
0 
1 ste 
= z) soltı +t+ 7)sQ(ti + t)dt 
T 00 


= Tsel) * 8Q(-7), (9.6-3) 


which is just a function of the shift r. Thus X(t) is a WSS random process. 


Example 9.6-2 —  ——————— > > 
(power spectral density of PSK) A WSS version of the random PSK signal was defined in 
Example 9.6-1 through an averaging process, where the average was taken over the message 
time or baud interval T. The resulting WSS random process X (t) had correlation function 
(Equation 9.6-3) given as 


1 
Rgg(T) = F8q(7) * sQ(=7), 
where sg(7) was given as 


sin(2r fer), O<7<T, 
0, else. 


sa(r) = { 
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Figure 9.6-4 Power spectral density of PSK plotted for fe = 2.5 and T = 0.5. 


Then the psd of this WSS version of PSK can be calculated as 
Sxx(w) = FT{Rzx(7)} 


= SIFT {s0(7)}P 


` sinw- 2rf)F\ | (sin(w+2nfe)F N 
~ (T/4) ( (w — 2r fe) F ) + ( (w +2rfe)F , 


for feT >> 1, (9.6-4) 


which can be plotted! using MATLAB. The file psd_PSK .m included on this book’s Web site. 

Some plots were made using psd_PSK.m, for two different sets of values for fe and T. 
First we look at the psd plot in Figure 9.6-4 for fe = 2.5 and T = 0.5, which gives consid- 
erable overlap of the positive and negative frequency lobes of Sz z(w). The lack of power 
concentration at the carrier frequency fe is not surprising, since there is only a little over 
one period of sg(¢) in the baud interval T. The next pair of plots show a quite different case 
with power strongly concentrated at we. This plot was computed with the values fe = 3.0 
and T = 5.0, thus giving 15 periods of the sine wave in the baud interval T. Figure 9.6-5 is 
a linear plot, while Figure 9.6-6 shows Sz z(w) on a logarithmic scale. 


+The reason for the approximate equals sign is that we have neglected the cross-term in Equation 9.6-4 
between the two sinc terms at +fe, as is appropriate for f-T >> 1. 
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Figure 9.6-5 Power spectral density of PSK plotted for f = 3 and T=5. 
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Figure 9.6-6 Log of power spectral density of PSK plotted for f- = 3 and T=5. 
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9.7 VECTOR PROCESSES AND STATE EQUATIONS 


In this section we will generalize some of the results of Section 9.5 to the important class of 
vector random processes. This will lead into a brief discussion of state equations and vector 
Markov processes. Vector random processes occur in two-channel systems that are used in 
communications to model the in-phase and quadrature components of bandpass signals. 
Vector processes are also used extensively in control systems to model industrial processes 
with several inputs and outputs. Also, vector models are created artificially from high-order 
scalar models in order to employ the useful concept of state in both estimation and control 
theory. 

Let X(t) and X2(t) be two jointly stationary random processes that are input to the 
systems Hı and Ho, respectively. Call the outputs Yı and Y2, as shown in Figure 9.7-1. 

From earlier discussions we know how to calculate Rx,y,, Rx,y,, Ry,y,, Ryzya. We 
now look at how to calculate the correlations across the systems, that is, Rx,y,, Rx.y,, 
and Ry,y,. Given Rx,x,, we first calculate 


Rxv (T) = B[Xi (t+ 7)Yz (t)] 


+00 
= J E[Xi(t + 7)X3(t — 2)]h3(8)d8 


+00 
= Rxx (T + B)h3(B)dB 


+00 


= Rx, x: (T 7 B')h3(-8') dp", (8 


—coO 


I 


—f), 


so 

Rx, (r) = Rx, x, (7) * h3(—7), 
and by symmetry 

Rxy, (r) = Rxx, (7) * hi(-r). 


The cross-correlation at the outputs is 


Ry, ya(T) = hi (T) * Rx, xa (T) * h3(—7). 


X(t Y,(t) 
y(t) Hilo) 1 


X(t 
al ) Hlo) y(t) 


Figure 9.7-1 A generic (uncoupled) two-channel LSI system. 
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Figure 9.7-2 General two-channel LSI system. 


Expressing these results in the spectral domain, we have 
Sxin (w) = Sx, x2(w) He (w) 


and 
Sy, Y2 (w) = Fy (w) Hy (w) Sx, x, (w). 


In passing, we note the following important fact: If the supports? of the two system functions 

Hy and H; do not overlap, then Y, and Y2 are orthogonal random processes independent of 

any correlation in the input processes. We can generalize the above to a two-channel system 

with internal coupling as seen in Figure 9.7-2. Here two additional system functions have 

been added to cross-couple the inputs and outputs. They are denoted by Hj2 and Ho}. 
This case is best treated with vector notation; thus we define 


X(t) 2 (X(t), Xo()7,  -¥() FMM, ROF, 


a | hult) hilt) 
h(t) È ae mat}. 


where /,;(¢) is the impulse response of the subsystem with frequency response H;;(w). We 
then have ` 


Y(t) = h(t) * X(t), (9.7-1) 


tWe recall that the support of a function g is defined as 


supp(g) = {alg(x) # 0}. 
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where the vector convolution is defined by 


(h(t) * X(t); = rst 


If we define the following relevant input and output correlation matrices 
Rxx (r) Rxıx2 (7) 
Rx. x, (r) Rxx (T) 


Ryy (T) Ry yz (T) | 
Ry,y, (7) Ryzyz (7) 


Rxx(T) = e| (9.7-2) 


Ryy(r) ê | (9.7-3) 


one can show that (Problem 9.44) 


Ryy (7) = h(r) * Rxx(r) * h’ (—7), (9.7-4) 


where the f indicates the Hermitian (or conjugate) transpose. 
Taking the matrix Fourier transformation, we obtain 


Syy(w) = H(w)Sxx(w)H"(w) (9.7-5) 


with 
H(w) = FT{h(t)}, 


and 
S(w) = FT{R(7)}, 


where this notation is meant to imply an element-by-element Fourier transform. This multi- 
channel generalization clearly extends to the M input and N output case by just enlarging 
the matrix dimensions accordingly. 


State Equations 


As shown in Problem 9.43, it is possible to rewrite an Nth-order LCCDE in the form of 
a first-order vector differential equation where the dimension of the output vector is equal 
to N, 

Y(t) = AY (t) + BX(#), —co < t < +00. (9.7-6) 


This is just a multichannel system as seen in Equation 9.7-1 and can be interpreted as a set 
of N coupled first-order LCCDEs. We can take the vector Fourier transform and calculate 
the system function 

H(w) = (jwI — A) 1B (9.7-7) 


to specify this LSI operation in the frequency domain. Here I is the identity matrix. Alter- 
nately, we can express the operation in terms of a matrix convolution 


Y(t) = h(t) « X(t), 
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where we assume the multichannel system is stable; that is, all the impulse responses h,; are 
BIBO stable. The solution proceeds much the same as in the scalar case for the first-order 
equation; in fact, it can be shown that 


h(t) = exp(At) Bu(t). (9.7-8) 


The matrix exponential function exp(A¢) was encountered earlier in this chapter in the 

solution of the probability vector for a continuous-time Markov chain. This function is 

widely used in linear system theory, where its properties have been studied extensively [9-3]. 
If we compute the cross-correlation matrices in the WSS case, we obtain 


Ryx(r) = exp(Ar) Bu(r) + Rxx(7) 


and 
Rxy (7) = Rxx(7) * Bt exp(—Atr)u(—7), 


with output correlation matrix, as before, 
Ryy(r) = h(r) * Rxx(r) * hl (—7). (9.7-9) 
Upon vector Fourier transformation, this becomes 
Syy(w) = (jwI — A)~'BSxx(w)B!(—jwI — At)". (9.7-10) 


If Rxx (7) = Q(T), then since the system H is assumed causal, that is, h(t) = 0 for t < 0, 
we have that the cross-correlation matrix R, yx(r) = 0 for r < 0; that is, E[Y (t+7)Xt (t)] = 
O for 7 < 0. In words we say that Y(t + 7) is orthogonal to X(t) for r < 0. Thus, the 
past of Y(t) is orthogonal to the present and future of X(t). If we additionally assume 
that the input process X(t) is a Gaussian process, then the uncorrelatedness condition 
becomes an independence condition. Under the Gaussian assumption then, the output Y(t) 
is independent of the present and future of X(t). A similar result was noted earlier in the 
scalar-valued case. We can use this result to show that the solution to a first-order vector 
LCCDE is a vector Markov random process with the following definition. 


Definition 9.7-1 (vector Markov) A random process Y(t) is vector Markov if for all 
n > 0 and for all tn > tn-1 >... > tı, and for all values y(t,_1),..., y(t), we have 


PIY (tn) < Ynly(tn-1); e.. »y(t1)] = P[Y¥ (tn) < Ynly(tn—-1)] 
for all values of the real vector y,,. Here A < a means 
(An < an, An-1 < an-1,---,Ai <a). E 


Before discussing vector differential equations we briefly recall a result for deterministic 
vector LCCDEs. The first-order vector equation, 


y(t) = Ay(t) + Bx(t),  t> to, 


622 Chapter 9 Random Processes 








subject to the initial condition y(t)), can be shown to have solution, employing the matrix 
exponential 


y(t) = exp[A(t — to)ly (to) + f h(t—v)x(v)dv, t2 to, 


thus generalizing the scalar case. This deterministic solution can be found in any graduate 
text on linear systems theory, for example, in [9-3]. The first term is called the zero-input 
solution and the second term is called the zero-state (or driven) solution analogously to the 
solution for scalar LCCDEs. 

We can`extend this theory to the stochastic case by considering the differential 
Equation 9.7-6 over the semi-infinite domain tg < t < co and replacing the above determin- 
istic solution with the following stochastic solution, expressed with the help of an integral: 


t 
Y(t) = exp[A(t — to) |¥ (to) + f h(t — v)X(v)dv. (9.7-11) 
to 
If the LCCDE is BIBO stable, that is, the real parts of the eigenvalues of A are all 
negative, in the limit as tp — —oo, we get the solution for all time, that is tọ = — œœ, 
t 
Y(t) = J h(t — v)X(v)dv = h(t) * X(t), (9.7-12) 


which is the same as already derived for the stationary infinite time-interval case. In effect, 
we use the stability of the system to conclude that the resulting zero-input part of the 
solution must be zero at any finite time. 

The following theorem shows a method to generate a vector Gauss-Markov random 
process using the above approach. The input is now a white Gaussian vector process W(t) 
and the output vector Markov process is denoted by X(t). 


Theorem 9.7-1 Let the input to the state equation 
X(t) = AX(t) + BW(t) 


be the white Gaussian process W(t). Then the output X(t) is a vector Gauss-Markov 
random process. 


Proof We write the solution at tn in terms of the solution at an earlier time t,_1 as 


X(tn) = exp[A(tn — tr—1)|X(tn—1) + f k h(t, — v)W(v)dv. 


tn-1 
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Then we write the integral term as I(t,,) and note that it is independent of X(¢,—1). Thus 
we can deduce that 


P[X (tn) < xn|x(tn-1),.--,x(t1)] 
= P[I(tn) < Xn — e^t t-d tan1)|x(tn-1),---x(t1)] 
= P[I(tn) < Xn — eA(in—tn-) x(t, 1) [X(tn—1)] 
and hence that X(t) is a vector Markov process. $ 


If in Theorem 9.7-1 we did not have the Gaussian condition on the input W (t) but 
just the white noise condition, then we could not conclude that the output was Markov. 
This is because we would not have the independence condition required in the proof but 
only the weaker uncorrelatedness condition. On the other hand, if we relax the Gaussian 
condition but require that the input W(t) be an independent random process, then the 
process X(t) would still be Markov, but not Gauss-Markov. We use X for the process in 
this theorem rather than Y to highlight the fact that LCCDEs are often used to model 
input processes too. 


SUMMARY 


In this chapter we introduced the concept of the random process, an ensemble of functions 
of a continuous parameter. The parameter is most often time but can be position or another 
continuous variable. Most topics in this chapter generalize to two- and three-dimensional 
parameters. Many modern applications, in fact, require a two-dimensional parameter, for 
example, the intensity function i(¢;,¢2) of an image. Such random functions are called 
random fields and can be analyzed using extensions of the methods of this chapter. Random 
fields are discussed in Chapter 7 of [9-5] and in [9-8] among many other places. 

We introduced a number of important processes: asynchronous binary signaling; the 
Poisson counting process; the random telegraph signal; phase-shift keying, which is basic 
to digital communications; the Wiener process, our first example of a Gaussian random 
process and a basic building block process in nonlinear filter theory; and the Markov process, 
which is widely used for its efficiency and tractability and is the signal model in the widely 
employed Kalman-—Bucy filter of Chapter 11. 

We considered the effect of linear systems on the second-order properties of random 
processes. We specialized our results to the useful subcategory of stationary and WSS 
processes and introduced the power spectral density and the corresponding analysis for LSI 
systems. We also briefly considered the classes of wide-sense periodic and cyclostationary 
processes and introduced random vector processes and systems and extended the Markov 
model to them. ' 


PROBLEMS 


(*Starred problems are more advanced and may require more work and/or additional 
reading.) 
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9.1 


9.2 


9.3 


Let X[n] be a real-valued stationary random sequence with mean E{X{n]} = px 
and autocorrelation function E{X[n + m]X[n]} = Rxx[m]. If X[n] is the input to 
a D/A converter, the continuous-time output can be idealized as the analog random 
process X,(t) with l 


X,(t) 2 X[n] forn<t<n+1, foralln, 


as shown in Figure P9.1. 





1 — 4 5 6 7 8 9 


Figure P9.1 Typical output of sample-hold D/A converter. 


(a) Find the mean E[X,(t)] = u,(t) as a function of py. 

(b) Find the correlation E[Xq(t1)Xa(t2)] = Rx,x, (t1,2) in terms of Rxx [m]. 
Consider a WSS random sequence X[n] with mean function yy, a constant, and 
correlation function Rx x [m]. Form a random process as 


+00 . 
x2 Xn —00 < t < too. 


n=— 00 


In what follows, we assume the infinite sums converge and so, do not worry about 
stochastic convergence issues. 


(a) Find p(t) in terms of py. Simplify your answer as much as possible. 

(b) Find Rxx (tı, t2) in terms of Rxx[m]. Is X(t) WSS? 
Hint: The sampling theorem from Linear Systems Theory states that any bandlim- 
ited deterministic function g(t) can be recovered exactly from its evenly spaced 
samples, that is, 


tx in r(t — nT)/T 
= So san E, 


n=— 00o 
when the radian bandwidth of the function g(t) is n/T or less. 
Consider the random process Y (t) = (—1)*“), where X(t) is a Poisson process with 
rate A. Thus, Y (t) starts at Y(0) = 1 and switches back and forth from +1 to —1 at 
random Poisson times T; 


(a) Find the mean of Y (t) 
(b) Find the autocorrelation function of Y(t) 
(c) If Z(t) = AY (t) 
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9.4 


9.5 


where A is a random variable independent of Y(t) and takes on the values +1 with 
equal probability, show that Z(t) is WSS and find the power spectral density of 
Z(t). 
The output Y(t) of a tapped delay-line filter shown in Figure P9.4, with input X(t) 
and N taps, is given by 
N-1 
Y(t) = X AnX(t—nT). 





Figure P9.4 Tapped delay-line filter. 


The input X(t) is a stationary Gaussian random process with zero mean and autocor- 
relation function Rx x(7T) having the property that Rxx(nT) = 0 for every integer 
n #0. The tap gains A,,n = 0,1,..., N — 1, are zero-mean, uncorrelated Gaussian 
random variables with common variance gł. Every tap gain is independent of the 
input process X(t). 


(a) Find the autocorrelation function of Y(t). 
(b) For a given value of ż, find the characteristic function of Y(t). Justify your 


steps. 
(c) For fixed t, what is the asymptotic pdf of gr (0), asymptotic as N — 00? 
Explain. 


(d) Suppose now that the number of taps N is a Poisson random variable with 
mean A(> 0). Find the answers to parts (a) and (b) now. 


(Note: You may need to use the following: e77 ~ 74, for |x| << 1, and e” = 
Dro aT: 


Let N(t) be a Poisson random process defined on 0 < t < œo with N(0) = 0 and 
mean arrival rate À > 0. 


(a) Find the joint probability P[N (t1) = nı, N (t2) = ng] for te > tı. 
(b) Find an expression for the Kth order joint PMF, 


Py(m,...,2K3t1,...,tk), 


with 0 < tı < tg <... < tg < co. Be careful to consider the relative values 
of n1,...,nK. 
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*9.6 The nonuniform Poisson counting process N (t) is defined for t > 0 as follows: 


(a) N(0) =0. 
(b) N(é) has independent increments. 


(c) For all tg > tı, 
t2 n 
| Aw) 


tı 


t2 
PIN (t2) — N(ti) = n] = exp (- Aw)d) , forn>0. 


n! th 


The function A(t) is called the intensity function and is everywhere nonnegative, 
that is, A(t) > 0 for all t. 


(a) Find the mean function y(t) of the nonuniform Poisson process. 
(b) Find the correlation function Ryw(ti,t2) of N(t). Define a warping of the 
time axis as follows: 


T(t) ef A(v)dv. 


Now 7(t) is monotonic increasing if A(v) > 0 for all v, so we can then define 
the inverse mapping t(7) as shown in Figure P9.6. 


qit) 





0 tir) 


Figure P9.6 Plot of 7 versus t. 


(c) Assume A(t) > 0 for all ¢ and define the counting process, 
Nu(r) Ê N(t(r)). 
Show that N,,(7) is a uniform Poisson counting process with rate A = 1; that 


is, show for 7 > 0 


(1) Nu(0) = 0. 
(2) N.z(7) has independent increments. 
(3) For all T2 > 71, 


PINa) — Nu(ra) ns GW tm) nao. 


9.7 A nonuniform Poisson process N(t) has intensity function (mean arrival rate) 
A(t) = 1+ 2¢, 
for t > 0. Initially N(0) = 0. 
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(a) Find the mean function py (t). 

) Find the correlation function Ry vy (t1, te). 

(c) Find an expression for the probability that N(t) > t, that is, find P[N (t) > t] 
for any t > 0. 

(d) Give an approximate answer for (c) in terms of the error function erf(z). 


*9.8 This problem concerns the construction of the Poisson counting process as given in 





Section 9.2. 
(a) Show the density for the nth arrival time T[n] is 
AI ae 
fr(tjn) = (n= iyi’ u(t), n> 0. 


In the derivation of the property that the increments of a Poisson process are 
Poisson distributed, that is, 


[A(ta = tp)|” 


PIX (ta) — X(ts) =n] = 2 


erta—to ain], ta > th, 


Tli 


Figure P9.8 Illustrative example of relation of arrival times to arbitrary observation interval. 


we implicitly use the fact that the first interarrival time in (tp, ta] is exponen- 
tially distributed. Actually, this fact is not clear as the interarrival time in 
question is only partially in the interval (tp, ta]. A pictorial diagram is shown 
in Figure P9.8. Define 7’ [i] £ T|i] — t, as the partial interarrival time. We 
note T'[i] = T[i] — T, where the random variable T 2 ty - T[i — 1] and T[i] 
denotes the (full) interarrival time. 

(b) Fix the random variable T = t and find the CDF 


Frl T = t) = P{t [i] < T’ + tirli] > t}. 


(c) Modify the result of part (b) to account for the fact that T is a random 
variable, and find the unconditional CDF of 7’. (Hint: This part does not 
involve a lot of calculations.) 


Because of the preceding properties, the exponential distribution is called memory- 
less. It is the only continuous distribution with this property. 
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Let N(t) be a counting process on [0, 00) whose average rate A(t) depends on another 
positive random process S(t), specifically A(t) = S(t). We assume that N(t) given 
{S(t) on [0, co)} is a nonuniform Poisson process. We know s(t) = Ho > 0 and also 
know Kgs(ti, t2). 


(a) Find y(t) for t > 0 in terms of pọ. 
(b) Find cå, (t) for t > 0 in terms of Kgg(th, t2). 


Let the random process K (t) (not a covariance!) depend on a uniform Poisson process 
N(t), with mean arrival rate À > 0, as follows: Starting at t = 0, both N(t) = 0 and 
K(t) = 0. When an arrival occurs in N (t), an independent Bernoulli trial takes place 
with probability of success p, where 0 < p < 1. On success, K (t) is incremented by 1, 
otherwise K (t) is left unchanged. This arrangement is shown in Figure P9.11. Find 
the first-order PMF of the discrete-valued random process K(t) at time ż, that is, 
P(k;t), for t > 0. 


Poisson Bernoulli K(t) 
process trial 
generator generator 





Figure P9.11 Poisson-modulated Bernoulli trial process. 


Let the scan-line of an image be described by the spatial random process S(x}, which 
models the ideal gray level at the point z. Let us transmit each point independently 
with an optical channel by modulating the intensity of a photon source: 


Mt,x)=S(z) +A, O<t<T. 


In this way we create a family of random processes, indexed by the continuous 
parameter z, 


{N(t, z)}. 


For each z, N(é,z) given S(x) is a uniform Poisson process. At the end of the 
observation interval, we store N(z) SN (T,x) and inquire about the statistics of 
this spatial process. 

To summarize, N(x) is an integer-valued spatial random process that depends on 
the value of another random process S(z), called the signal process. The spatial 
random process S(z) is stationary with zero mean and covariance function 


Kgs(x) = o% exp(—alz}), 
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9.12 


*9.13 


where œ > 0. The conditional distribution of N(x), given S(x) = s(x), is Poisson 
with mean A(z) = (s(x) + Ao)T, where Ap is a positive constant; that is, 


P[N(z) = n|S(z) = s(x)| = YO) re ujn 


The random variables N (zx) are conditionally independent from point to point. 


(a) Find the (unconditional) mean and variance 
py(2) = BIN(2)] and E[(N(z) - uu (2))’]. 
(Hint: First find the conditional mean and conditional mean square.) 
(b) Find Ryn(x1,22) È E[N(21)N(22)]. 


Let X(t) be a random telegraph signal (RTS) defined on t > 0. Fix X(0) = +1. The 
RTS uses a Poisson random arrival time sequence Tn] to switch the value of X(t) 
between +1. Take the average arrival rate as A(> 0). Thus we have 


1 O<t<T{l] 
-1, Th)<t<T] 
+1, T[2]<t<T[B] 


3 


xê 


(a) Argue that X(t) is a Markov process and draw and label the state-transition 
diagram. 

(b) Find the steady-state probability that X(t) = +1, that is, Px (1; o0), in terms 
of the rate parameter À. 

(c) Write the differential equations for the state probabilities Px(1;t) and 
P X (- 1; t) . 


A uniform Poisson process N (t) with rate (> 0) is an infinite-state Markov chain 
with the state-transition diagram in Figure P9.13a. Here the state labels are the 
values of the process (chain) N(t) between the transitions. Also the independent 
interarrival times 7[n] are exponentially distributed with parameter À. 


À À À À 


Figure P9.13a Poisson process represented as Markov chain. 


We make the following modifications to the above scenario. Replace the independent 
interarrival times r[n] by an arbitrary nonnegative, stationary, and independent 
random sequence, still denoted 7[n], resulting in the generalization called a renewal 
process in the literature. See Figure P9.13b. 
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Figure P9.13b More general (renewal) process chain. 


(a) Show that the PMF Py (n; t) = P[N(t) = n] of a renewal process is given, in 
terms of the CDF of the arrival times Fr(t; n), as 


Py(n;t) = Fr(t;n) — Fr(tin+1), whenn > 1, 


where the arrival time T[n] = $p- T[k] and Fr(t; n) is the corresponding 
CDF of the arrival time T[n]. 

(b) Let T[n] be U[0, 1], that is, uniformly distributed over (0, 1], and find Py (n; t) 
for n = 0,1, and 2, for this specific renewal process. 

(c) Find the characteristic function of the renewal process of part (b). 

) Find an approximate expression for the CDF F7(t;n) of the renewal process 
in part (b), that is good for large n, and not too far from the T[n] mean 
value. (Hint: For small z we have the trigonometric series approximation 
sinz = x — 23/3!) 

If X(t) with X(0) = 0 and p = 0 is a Wiener process, show that Y(t) = 0 X(t/o7) 
is also a Wiener process. Find its covariance function. 
Let W,(t) and W2(t) be two Wiener processes, independent of one another, both 
defined on ¢ > 0, with variance parameters a, and a2, respectively. Let the process 
X(t) be defined as their algebraic difference, that is, X(t) ê wi (t) — W2 (t). 

(a) What is Rxx (tı, t2) for hh, te > 0? 

(b) What is the pdf fx (z;t) for t > 0? 
If the 2n random variables A, and B, are uncorrelated with zero mean and E(A2) = 
E(B?) = o?, show that the random process 


Tt 
X(t) = D(A cos wrt + B, sin wrt) 
r=1 
is wide-sense stationary. What are the mean and autocorrelation of X(t)? 
Let W(t) be a standard Wiener process, that is, a = 1, and define 


X(t) 2 W(t) for t>0. 


(a) Find the probability density fx (x; t). 

(b) Find the conditional probability density fx (xq|x1; t2,t1), te > ty. 
(c) Is X(t) Markov? Why? 

(d) Does X(t) have independent increments? Justify. 
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9.18 Let X(t) be a Markov random process on [0,00) with initial density fx(z;0) = 
6(2 — 1) and conditional pdf 


1 -) 
x2|21; tz, t1) = ——= exp (-=* ——“ },_ for all ty > ty. 
fx 2l pe 1) 2r(tz2 — tı) x ( 2 t2 — tı 
(a) Find fx(zx;t) for all t. 
(b) Repeat part (a) for fx(x;0) ~ N(0,1). 

9.19 Consider the three-state Markov chain N (t) with the state-transition diagram shown 
in Figure P9.19. Here the state labels are the actual outputs, eg. N(t) = 3, while 
the chain is in state 3. The state transitions are governed by jointly independent, 
exponentially distributed interarrival times, with average rates as indicated on the 
branches. 


(a) Given that we start in state 2 at time t = 0, what is the probability (condi- 
tional probability) that we remain in this state until time t, for some arbitrary 
t > 0? (Hint: There are two ways to leave state 2. So you will leave at the 
lesser of the two independent exponential random variables with rates ps 
and 2.) 

(b) Write the differential equations for the probability of being in state i at time 
t > 0, denoting them as p;(t), i = 1,2,3. [Hint: First write p;(t+ ôt) in terms 
of the p;(t), i = 1,2,3, only keeping terms up to order O(6t).] 

(c) Find the steady-state solution for p;(t) for i = 1,2,3, that is, p;(oo). 


Ay À2 


H2 B3 


Figure P9.19 A three-state continuous-time Markov chain. 


9.20 Let a certain wireless communication binary channel be in a good state or bad state, 
described by the continuous-time Markov chain with transition rates as shown in 
Figure P9.20. Here we are given that the exponentially distributed state transitions 
have rates A; = 1 and Az = 9. The value of e for each state is given in part (b) 
below. 


(a) Find the steady-state probability that the channel is in good state. Label 
P{X(t) = good } = po, and P{X(t) = bad } = p,. (Hint: Assume the 
steady state exists and then write p, at time t in terms of the two possibilities 
at time £ — 6, keeping only terms to first order in 6, taken as very small.) 

(b) Assume that in the good state, there are no errors on the binary channel, but 
in the bad state the probability of error is e = 0.01 Find the average error 
probability on the channel. (Assume that the channel does not change state 
during the transmission of each single bit.) 


632 Chapter 9 Random Processes 





pal 1-e 
0 0 

> TS 
1 1 

1-€ 


Figure P9.20 Model of two-state wireless communication channel. 


9.21 This problem concerns the Chapman—Kolmogorov equation (cf. Equation 9.2-22) for 
a continuous-amplitude Markov random process X(t), 


+00 


fx (x(ts)|@(t1)) = fx (a(ts)|x(te)) fx (w(te)|x(ti)) dx (ta), 


for the conditional pdf at three increasing observation times t3 > t2 > tı > 0. You 
will show that the pdf of the Wiener process with covariance function Kx x(t,s) = 
amin(t, s), œ > 0, solves the above equation. 


(a) Write the first-order pdf fx (a(t)) of this Wiener process for t > 0. 

(b) Write the first-order conditional pdf fx(x(t)|x(s)), t > s > 0. 

(c) Referring back to the Chapman-Kolmogorov equation, set t3—t2 = t2—tı = 6 
and use z3, £2, and x, to denote the values taken on. Then verify that your 
conditional pdf from part (b) satisfies the resulting equation 


+co 
fx (x3|r1) = fx (x3|r2)fx (x2|z1) dro. 
—00 
9.22 Is the random process X’(t) of Example 9.3-2 stationary? Why? 
9.23 Let A and B bei.i.d. random variables with mean 0, variance o?, and third moment 


m3 E [A3] = E[B?] 4 0. Consider the random process 
X(t) = Acos(27ft) + Bsin(27ft), —oo < t < +00, 


where f is a given frequency. 


(a) Show that the random process X(t) is WSS. 
(b) Show that X(t) is not strictly stationary. 


9.24 Verify whether the sine-wave process {X(t)} where X(t) = Y cos wt, where w is a 
constant and Y is uniformly distributed in (0,1), is a strict sense stationary process. 

9.25 If X(t) = p+ N(t), where E[X(t)] = u, N(t) is a white noise with autocovariance 
function K (ti, t2) = @(t1)6(t1 — t2) where Q(t) is a bounded function of t and Q is 
the unit impulse function, prove that {X(t)} is a mean-ergodic process. 

9.26 If X(t) is a wide-sense stationary process with autocorrelation function Rx x(t) = 
Az@"!, determine the second-order moment of the random variable X(8) — X(5). 
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9.27 


9.28 


9.29 


9.30 








The power spectrum of a wide-sense stationary process {X(¢)} is given by Sxx(w) 
= 1/(1+w?)*. Find its autocorrelation function Rx x (+) and average power. 
Consider the LSI system shown in Figure P9.28, whose input is the zero-mean 
random process W(t) and whose output is the random process X(t). The frequency 
response of the system is H(w). Given Kww (T) = 6(r), find H(w) in terms of the 
cross-covariance Kxw (rT) or its Fourier transform. 


W(t) Hw) X(t) 


Figure P9.28 LSI system with white noise input. 


If the input z(t) and y(t) are connected by the differential equation rv) +y(t) = 


x(t), prove that they can be related by means of a convolution type integral. Assume 
that z(t) and y(t) are zero for t < 0. 


Consider the first-order stochastic differential equation 
dX(t 
uo + X(t) = W(t) 


driven by the zero-mean white noise W(t) with correlation function Rww(t,s) = 
d(t — s). 

(a) If this differential equation is valid for all time, —oo < t < +00, find the psd 
of the resulting wide-sense stationary process X (t). 

(b) Using residue theory (or any other method), find the inverse Fourier transform 
of Sxx(w), the autocorrelation function Rx x(T), —oo < T < +00. 

(c) If the above differential equation is run only for t > 0, is it possible to 
choose an initial condition random variable X(0) such that X(t) is wide- 
sense stationary for all t > 0? If such a random variable exists, find its mean 
and variance. You may assume that the random variable X(0) is orthogonal 
to W(t) on t > 0; that is, X(0) L W(t). [Hint: Express X(t) for t > 0 in 
terms of the initial condition and a stochastic integral involving W (t).] 
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9.31 


9.32 


9.33 


9.34 
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If the random process {X (£)} is defined as X (t) = Y (t) Z(t) where Y(t) and Z(t) are 
independent wide-sense stationary processes, determine the power spectral density 
of X(t). 

Consider a wide-sense stationary process X (t) with autocorrelation function Rx x (4) 


aX (t) . Show that 


and power spectral density function Sx x(w). Let X(t) i 


(a) Ryx(t) = {Rxx (7) 
2 


(b) Ryx(t) = pa Rxx (7) 
(c) Syx(w) = w?Sxx(w) 


If X(t) is the input voltage to a circuit (system) and Y(t) is the output voltage 
where X(t) is a stationary random process with mean yy = 0 and autocorrelation 
R 

function Rx x(t) = e~*!land if the power transfer function is H(w) = Ry jiu’ find 
the mean py, the autocorrelation function Ryy(+) and the power spectral density 
Syy(w) of Y(t). 

A WSS and zero-mean random process Y(t) has sample functions consisting of 
successive rectangular pulses of random amplitude and duration as shown in Figure 
P9.34. 


Yit) 


Figure P9.34 Random amplitude pulse train. 
The pdf for the pulse width is 


Aer, w>0, 
ftw) = { 0, w<0, 


with A > 0. The amplitude of each pulse is a random variable X (independent 
of W) with mean 0 and variance o.. Successive amplitudes and pulse widths are 
independent. 


(a) Find the autocorrelation function Ryy(r) = E[Y(t+7)Y(t)]. 
(b) Find the corresponding psd Syy (w). 
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[Hint: First find the conditional autocorrelation function E[Y(t+7)Y(t)|W = w], 
where t is assumed to be at the start of a pulse (do this without loss of generality 
per WSS hypothesis for Y‘(t)).] 

9.35 The power spectral density of a zero mean wide-sense stationary process {X(t)} is 
given by 

1, |w| < wo 

0, elsewhere 


S XX (w) = { 
Determine the autocorrelation function of {X(t)} and show that {X(t)} and 


X (t + w) are uncorrelated. 


*9.36 In this problem we consider using white noise as an approximation to a smoother 
process (cf. More on White Noise in Section 9.5), which is the input to a lowpass 
filter. The output process from the filter is then investigated to determine the error 
resulting from the white noise approximation. Let the stationary random process 
X(t) have zero mean and autocovariance function 


1 
Kxx(tT) = Oro exp(—|t|/T0), 


which can be written as h(r) *h(—7) with h(r) = 4e77/7u(r). 


X(t) Glo) Yit) 


Figure P9.36a Approximation to white noise input to filter. 


(a) Let X(t) be input to the lowpass filter shown in Figure P9.36a, with output 
Y(t). Find the output psd Sy (w), for 


A 1, |u| < wo 
Gw) = f else. 


Wit) V(t) 


Figure P9.36b White noise input to filter. 


(b) Alternatively we may, at least formally, excite the system directly with a 
standard white noise W(t), with mean zero and Kww(r) = 6(r). Call the 
output V(t) as shown in Figure P9.36b. Find the output psd Syy (w). 
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(c) Show that for |woTo| << 1, Syy ~ Syv and find an upper bound on the 
power error 


|Rvv (0) — Ryy(0)|. 


9.37 Consider the LSI system shown in Figure P9.37. Let X(t) and N(t) be WSS and 
mutually uncorrelated with power spectral densities Sx. (w) and Syny (w) and zero 
means. 


Nit) 





Figure P9.37 


(a) Find the psd of the output Y (t). 
(b) Find the cross-power spectral density of X and Y, that is, find Sxy(w) and 
Sy x(w). 
(c) Define the error £(t) 2 Y(t) — X (t) and evaluate the psd of &(¢). 
(d) Assume that h(t) = aô(t) and choose the value of a which minimizes E|€?(t)] = 
Ree (0). 
9.38 Let X(t) be a random process defined by 
X(t) £ N cos(27 fot + O), 
where fo is a known frequency and N and O are independent random variables. The 
CF for N is 
By (w) = Elet] = exp{rle — 1]}, 
where À is a given positive constant (i.e., N is a Poisson random variable). The 
random variable © is uniformly distributed on [—7, +7]. 


(a) Determine the mean function p(t). 
(b) Determine the covariance function Kx x(t, s). 
(c) Is X(t) WSS? Justify your answer. 
(d) Is X(t) stationary? Justify your answer. 
9.39 Let X(t) be an independent-increment random process defined on t > 0 with initial 


value X(0) = Xo, a random variable. Assume the following CFs exist: E[e7”*°] 2 
®x, (w) and 

Elet(X@)-Xols))) a ® x(t) xXo(s) (w) for t> s. 
(a) On defining Efe?” *®] 26 x(t)(w), show that 


® x (1) (w) = ®x,(W)®x(4)_ xy (w). 
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9.40 


9.41 


*9.42 


9.43 


(b) Show that for all t2 > tı, the joint characteristic function of X (t2) and X(t,) 
is given by 


® x (ty), X(t) (W2,W1) = Bx, (wr + W2)®x(t,)—x_(W1 + W2)®x(t2)—x(t1) (w2). 


(c) Apply part (a) to Problem 9.18(b) by using Gaussian characteristic functions. 


Given a random variable Y with characteristic function ¢(w) = E[e'¥Y] and a random 
process X(t) = cos(At + Y), where À is a constant, show that {X (t)} is stationary 
in the wide sense if ¢(1) = ¢(2) = 0. 
Let X(t) defined over t > O have independent increments with mean function 
Lx (t) = Ho and covariance function 


Kx x(t, t2) = o%,(min(t1, t2)), 


where o% (t) is an increasing function, that is, do2,(t)/dt > 0 for all t > 0, called the 


variance function. Note that Var[X(t)] = oł (t). Fix T > 0 and find the mean and 


covariance functions of Y (t) 2x (t) — X(T) for all t > T. (Note: For the covariance 


function take tı and t2 > T.) 
Following Example 9.2-3, use MATLAB to compute a 1000-element sample function 
of the Wiener process X(t) for a = 2 and T = 0.01. 


(a) Use the MATLAB routine hist.m to compute the histogram of X(10) and 
compare it with the ideal Gaussian pdf. 

(b) Estimate the mean of X(10) using mean.m and the standard deviation using 
std.m and compare them to theoretical values. [Hint: Use Wiener.nt in a 
for loop to calculate 100 realizations of x(1000). Then use hist. Question: 
Why can’t you just use the last 100 elements of the vector x to approximately 
obtain the requested statistics?] 


Let the WSS random process X(t) be the input to the third-order differential 
equation 
dY ay dY 
TZ +a +. a, — + a ¥(t) =X 
ap + qa +a Ge tao (t) (t), 
with WSS output random process Y(t). 


(a) Put this equation into the form of a first-order vector differential equation 


dY _ Ay(t) + BX(t), 
dt 
[xo 
by defining Y(t) = | Y(t) | and X(t) a [X(t)] and evaluating the matrices 
Y” (t) 
A and B. 


İWiener.m is provided on this book’s Web site. 
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(b) Find a first-order matrix-differential equation for Rxy(r) with input 
Rxx(t ). 

(c) Find a first-order matrix-differential equation for Ryy(7) with input 
Rxy(r ). 

(d) Using matrix Fourier transforms, show that the output psd matrix Syy is 
given as 


Syy(w) = (jwI — A)~'BSxx(w)Bt(—jwI — At)7}. 
9.44 Let X(t) be a WSS vector random process, which is input to the LSI system with 
impulse response matrix h(t). 


(a) Show that the correlation matrix of the output Y(t) is given by 
Equation 9.7-4. 
(b) Derive the corresponding equation for matrix covariance functions. 


9.45 In geophysical signal processing one often has to simulate a multichannel random 
process. The following problem brings out an important constraint on the power 
spectra] density matrix of such a vector random process. Let the N-dimensional 
vector random process X(t) be WSS with correlation matrix 


Rxx(r) Ê E[X(¢ + 7)X"(£)] 


and power spectral density matrix 


Sxx(w) 2 FT{Rxx(7)}. 


Here FT{-} denotes the matrix Fourier transform, that is, the (i, 7)th component of 
Sxx is the Fourier transform of the (i, 7)th component of Rxx, which is E[X,(t+7) 
X}(t)|, where X;(¢) is the ith component of X(t). 


(a) For constants a),...,@j define the WSS scalar process 


N 
Y(t) 4 Z ai Xı(t). 
i=1 


Find the power spectral density of Y(t) in terms of the components of the 
matrix Sxx(w). 

(b) Show that the psd matrix Sxx(w) must be a positive semidefinite matrix for 
each fixed w; that is, we must have aT Sxx(w)a* > 0 for all complex column 
vectors a. 


9.46 Consider the linear system shown in Figure P9.46 excited by the two orthogonal, 
zero-mean, jointly WSS random processes X(t), “the signal,” and U (t), “the noise.” 
Then the input to the system G is 


Y(t) = h(t) * X() + U(t), 
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which models a distorted-signal-in-noise estimation problem. If we pass this Y(t), 
“the received signal” through the filter G, we get an estimate X(t). Finally €(t) can 
be thought of as the “estimation error” 


e(t) = X(t) — X(t). 


Uit) 





Figure P9.46 System for evaluating estimation error. 


In this problem we will calculate some relevant power spectral densities and cross- 
power spectral density. 


(a) Find Syy (w). 

(b) Find Syg (w) = Sy yw), in terms of H, G, Sxx, and Syy. 

(c) Find See(w). 

(d) Use your answer to part (c) to show that to minimize See(w) at those frequen- 
cies where 


Sxx(w) >> Suu(w), 
we should have G ~ H~? and where 
Sxx(w) << Syu(w) 


we should have G ~ 0. 


*9.47 Let X(t), the input to the system in Figure P9.47, be a stationary Gaussian random 
process. The power spectral density of Z(t) is measured experimentally and found 
to be 

28 


(w? + B7)(w2 +1)” 


Xit uerer x%(1)=Vit) no | Z(t) 


Ai(t)=eu(t) 


Szz(w) = m6(w) + 


Figure P9.47 Squarer nonlinearity followed by linear filter. 


(a) Find the correlation function of Y (t) in terms of 2. 
(b) Find the correlation function of X (t). 
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9.48 Consider the two-state Markov chain N (t) shown in Figure P9.48, taking on values 
1 and 2. While in state 1, the transition time to state 2 has average rate A, = 1. 
In state 2, the transition time to state 1 has average rate àz = 2. Denote the state 


probabilities as P, (t) and P2(t), where P;(t) 2 P[N(t) = i] for i = 1,2. 


Figure P9.48 Two-state Markov chain state-transition diagram. 


(a) Derive the differential equations for the P;(t). 
(b) Find their steady-state solution. 


9.49 The Schwarz inequality for complex-valued random variables states that 


IE[XY*]| < VEIX|?] Elly?) , 
for two random variables X and Y. 


(a) Use the Schwarz inequality to derive the corresponding result for WSS random 
processes X(t) and Y(t), 


IRxy(7)| < V Rxx(0) Ryy (0) . 


(b) Find the corresponding result for cross-power spectral densities, 


ISxy(w)| < VSxx(w) Syy (w) . 


Hint: Interpret the result of part (a) in terms of cross- and auto-power 
spectra, and then introduce a narrow bandpass filter centered at an arbitrary 
frequency w. 


9.50 The Wiener process, also called Brownian motion, is the integral of white noise. 
Letting B(t) denote the Wiener process, with W (t) denoting the white noise, we can 
write 


B(t) = f ‘w(rar, t>0. 


Take W(t) to be a standard white noise with correlation function Rw (T) = 6(r). 


(a) Find and sketch the cross-correlation function Rgw (ti, t2). 
(b) Find and sketch the autocorrelation function Rgg(tı, t2). 


9.51 Consider the two-processor reliability problem of Example 9.2-4 in the text, a three- 
state continuous-time Markov random process X(t) with state-transition diagram 
shown in Figure P9.51. Here, X(t) denotes the number of processors “up” at time t. 
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(b) 


(c) 
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Write the state probability vector p(t) differential equation 
dp(t)/dt = Ap(t) 


and explicitly find the generator matrix A. 

We determine the steady-state probability vector p by solving the homogeneous 
matrix-vector equation Ap = 0, subject to the constraint that all the probabil- 
ities in the probability vector p sum to 1. Someone claims that the “proba- 
bility flows” across the dashed vertical lines in Figure P9.51 must balance in 
the steady-state, that is, 2up9 = Ap, and yp, = 2Ap2, where the p; are the 
elements of the vector p, that is, the steady-state probabilities of being in state 
i, i = 0,1,2. State why this is a reasonable assertion, and prove it by showing 
that the resulting equations satisfy Ap = 0. 





Figure P9.51 


Solve for the numerical steady-state probability values in the case when À = 0.001 
and u = 0.1 per hour. 


9.52 Consider the three-input, two-output LSI system shown in Figure P9.52. The input 


random processes X; (t), X2(t), and U(t) are jointly WSS and pairwise orthogonal, 
that is, X, L X2,Xı L U, and X2 L U. We are given the following functions: 
the indicated system functions H,G, and B, plus the three-input power spectral 
densities Sx, x,,Sx,x,, and Syy. You may express your answers in terms of these 
functions. 





Figure P9.52 
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(a) Find the input/output cross-power spectral density Sy, x, (w). 
(b) Find the input/output cross-power spectral density Sy, x, (w). 
(c) Find the output cross-power spectral density Sy, y, (w). 


9.53 Consider the following tapped delay-line problem. We have a random sequence A, 
for the taps and a WSS random process X(t) as the signal model. Assume the 
total number of taps is N and the tap spacing is T. Assume also that the random 
sequence Ån and the random process X(t) are jointly independent. The tapped 
delay-line output is therefore 


N-1 
Y(t) = X AnX(t—nT). 


n=0 


The correlation function for the random sequence of tap weights is given as R4 (n1, n2), 
and the correlation function of the WSS random process is given as R.x(r). 


(a) Find the output correlation function Ry (t1, t2) in terms of the given functions 
and parameters. 

(b) Does the wide-sense stationarity of Y(t) depend on whether the random 
sequence Án is WSS? Justify your answer. 

(c) In finding your result in part (a), is it sufficient that A, and X (t) be uncor- 
related? Why? 


9.54 Let a certain wireless packet channel (Gilbert channel model) having a good state 
and a bad state be modeled as a continuous-time, two-state Markov chain with 
transition rates as given in Figure P9.54. 


Figure P9.54 Gilbert channel model. 


(a) Find the steady-state probability of being in the bad state. 
(b) In the good state, all packets are received. In the bad state, all packets are 
lost. This leads to bursts or clusters of lost packets. In a packet-loss burst, all 
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packets are lost. What is the average length of a packet-loss burst? Justify. 
Note that the chain is in the bad state for the full duration of a packet-loss 
burst. 


9.55 Consider a Poisson random process N(t) with average arrival rate À = 3. 
(a) Find the probability that N(4) = 2. 
(b) Find the joint probability that N(1) = 1 and N(2) = 2. 

9.56 Consider the system shown in Figure P9.56. 


Vin] 


Xin] Yin] 


(+) Hw) 


Figure P9.56 


Let X[n] and V[n] be WSS and mutually uncorrelated with zero means and power 
spectral densities Sx x(w) and Syy(w), respectively. 


(a) Find the psd of the output Y fn]. 
(b) Find the cross-power spectral density between input X(t) and output Y (t), 
that is, Sxy(w). 


9.57 Let X(t) and Y (t) be two zero-mean random processes with known correlation coef- 


ficient function 
E[X (t1) ¥* (t2)] 


pxy (t1,t2) & a 
y EIX )| JEIIY (t2)1°] 


and assume that the average powers E||X(t)|"] = ENY (£)|] £ P, a constant. Next, 
add two random noises U (t) and V(t), jointly orthogonal to X(t) and Y(t), 


X(t) 2 X(t) +U(0), 
Y(t) S(t) + V(0), 


where U and V are also orthogonal to each other and of zero mean, and with average 


powers E[|U(t)|?] = E[|V(t)]?] a €, a constant. Find the correlation coefficient 


function of the tilde processes, that is, pgy (tı, t2) in terms of that of the original 
processes X and Y. 

9.58 Consider the system shown in Figure P9.58. The two-input random sequences are 
WSS and given in terms of their power spectral densities: 


2w? + 8 


Sxx() = FFS) 


1 
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Xin] Yin] 


(+) Hw) 


Figure P9.58 System with signal plus noise input. 


The system function H(w) is given as 10[u(w + 2/2) — u(w — 7/2)] over the interval 
[-r, +r], where the function u is the unit step. Assume that X and V are zero 
mean. 


(a) Assuming X and V are uncorrelated, find the psd of the output random 
sequence Y |n]. 
(b) Let the cross-power spectral density of X and V be specified as 


1 

Sxvo) = ayp 
and find the new output power spectral density of Y. 

Consider the random process X(t) = cos(wot + ©), where © is a random variable 
uniformly distributed over the interval [0,27], and wo is a fixed frequency. Find 
the first-order pdf fx(z;t). Is the process stationary of first order? Find the 
conditional pdf of X(t2) given X(t) = zı, which we denote by fx (22/21; f1, t2). 
You may assume tı < te. 
Let Z(t) = X(t) + jY (t), where X(t) and Y(t) are jointly WSS and real-valued 
random processes. Assume that X(t) and Y(t) are mutually orthogonal with zero- 
mean functions. Define a new random process in terms of a modulation to a carrier 
frequency wo as U(t) = Re{Z(t)e-“°t}. Given the relevant correlation functions, 
that is, Rxx(7) and Ryy(r), find general conditions on them such that U (t) is also 
a WSS random process. Show that your conditions work, that is, that the resulting 
process U(t) is actually WSS. Some helpful trigonometric identities: 


cos(a + 3) = cosa cos 3 F sin asin B 
sin(a + 8) = sin a cos 8 + cosa sin 2. 


Find the steady-state probabilities of the four-state Markov chain shown in Figure 
P9.61. Express your answers in terms of the exponential rates À; and p;. Note the 
state labels are conveniently given as 1 through 4. 


oRokONO 


Figure P9.61 


Hint: Remember the probability flow concept from Problem 9.51 (b). 
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9.62 Consider the three-state Markov process X(t) with state-transition diagram shown 
in Figure P9.62. Here the state labels are the actual outputs, that is, X(t) = 3 all the 
while the process is in state 3. The state transitions are governed by jointly inde- 
pendent, exponentially distributed interarrival times, with average rates as indicated 
on the branches 


Aj Ao 
Hy Hy 


Figure P9.62 Three-state Markov process. 


(a) Given that we start at state 2 at time t = 0, what is the probability that we 
leave this state for the first time at time t, for some arbitrary t > 0? 
(b) Find the vector differential equation for the state probability at time t > 0, 


dp 
— = Ap(t), 
y 7 APH 
where p(t) = [p1, p2, p3)”, expressing the generator matrix A in terms of the 
Aà; and p. 
(c) Show that the solution for t > 0 can be expressed as 


p(t) = exp(At) p(0), 


where p(0) is the initial probability vector and the matrix exp(At) is defined 
by the infinite series 


1 1 1 
exp(At) I+ At+ ai (At)? + (At) + gA 


Do not worry about convergence of this series, but it is known that it abso- 
lutely converges for all finite t. 
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INANE Review of Relevant 
Mathematics 





This section will review the mathematics needed for the study of probability and random 
processes. We start with a review of basic discrete and continuous mathematical concepts. 


A.1 BASIC MATHEMATICS 


We review the concept of sequence and present several examples. We then look at summation 
of sequences. Next the Z-transform is reviewed. 


Sequences 


A sequence is simply a mapping of a set of integers into the set of real or complex numbers. 
Most often the set of integers is the nonnegative integers {n > 0} or the set of all integers 
{-—00 < n < +00}. 

An example of a sequence often encountered is the exponential sequence a” for {n > 0}, 
which is plotted in Figure A.1-1 for several values of the real number a. Note that for |a| > 1, 
the sequence diverges, while for ja| < 1, the sequence converges to 0. For a = 1, the sequence 
is the constant 1, and for a = —1, the sequence alternates between +1 and —1. 

A related and important sequence is the complex exponential exp(jwn). These sequences 
are eigenfunctions of linear time-invariant systems, which just means that for such a 
system with frequency response function H(w), the response to the input exp(jwn) is just 
H(w) exp(jwn), a scaled version of the input. 


A-1 
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0 2 4 6 8 10 12 14 16 18 20 


Figure A.1-1 Plot of exponential sequence for three values of a = 1.05, 1.0, and 0.8. 


Convergence 


A sequence, denoted z[n] or £n, which is defined on the positive integers n > 1, converges 
to a limiting value z if the values z[n] become nearer and nearer to x as n becomes large. 
More precisely, we can say that for any given € > 0, there must exist a value No(e) 
such that for all n > No, we have |z[n] — z| < £. Note that No is allowed to depend 
on E. 





Example A.1-1 
Let the sequence an be given as 





an = 2" /(2" + 3°), 


and find the limit as n — oo. From observation, we see that the limit is an = 0. To complete 
the argument, we can then express No(£) from the equation 








where we assume that 0 < e < 1. We note that for any fixed 0 < € < 1, the value No is 
finite as required. 
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Summations 


Summations of sequences arise quite often in our work. A common sequence used to illustrate 
summation concepts is the geometric sequence a”. The following summation formula can 
be readily derived: Take ng > nı. 


na arı — qretl 
ee ee ee (A.1-1) 
no l-a 


Of course, when a = 1, the summation is just nz — nı +1. A simple way to see the validity 
of Equation A.1-1 is to first define $ = ie a” and then note that, by the special property 
of the geometric sequence, 

aS =S+a™%t_ 9™, 


Then, by solving for S, we derive Equation A.1-1 when a Æ 1. 
When |a| < 1, the upper limit of summation can be extended to oo to yield 





a” = for |a| <1. (A.1-2) 
non, l-a 
Another useful related summation is: 
sad m1 ntl 
5D na” = n for jaj <1. (A.1-3) 


nN 


Equations A.1-2 and A.1-3 most often occur with nı = 0. 


Z-Transform 


This transform is very helpful in solving for various quantities in a linear time-invariant 
system and also for the solution of linear constant-coefficient difference equations. The 
Z-transform is defined for a deterministic sequence z[n] as follows: 


+00 
X(z) = > z[njz~", for z E€ A. 


n=—00 
In this equation, the region Æ is called the region of convergence and denotes the set of 


complex numbers z for which the transform is defined. This set Æ is further specified as 
those z for which the relevant sum converges absolutely, that is, 


+00 


X lelli < œ. 


mnm=— CO 


This region .% can be written in general as Æ = {z : R_ < |z| < R+}, an annular shaped 
region. The set {z|R_ < |z| < R+} is to be read as “the set of all points z whose magnitude 
(length) is greater than R_ and less than R4.” 
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Example A.1-2 
Let the discrete-time sequence x[n] be given as the exponential 


z|n] = a” exp(jwon)u[n], 


where u[n] denotes the unit step sequence, u[n] = 1 for n > 0 and ujn] = 0 for n < 0. 
Calculating the Z-transform, we get 


X(z) = Ss; a” exp(jwon)z ” 
n=0 


So (aelo2-1)" (A.1-4) 


n=0 


1 
=; agn; fr lal> lal. B= {z: |a] < |z|}. 





The Z-transform is quite useful in discrete-time signal processing because of the 
following fundamental theorem relating convolution and multiplication of the corresponding 
Z-transforms. 


Theorem A.1-1 Consider the convolution of two absolutely summable sequences 
z[n] and hin], which generates a new sequence yfn] as follows: 


+00 
yn] = J, xlm)h[n — m] 
m——oo 
which we denote operationally as y = h* x. Then the Z-transform of y[n] is given in terms. 
of the corresponding Z-transforms of x and h as 


Y(z) = H(z)X(z) for z € 2n N Rz. 


Because the two sequences h and x are absolutely summable, their regions of conver- 
gence p, and Æ, will both include the unit circle of the z-plane, that is, {|z| = 1}. 
Then the Z-transform Y(z) will exist for z € Zn N z, which is then guaranteed to be 
nonempty. E 


After obtaining the Z-transform of a convolution using this result, one can often take 
the inverse Z-transform to get back the output sequence y[n]. There are several ways to do 
this, including expansion of the Z-transform Y(z) in a power series, doing long division in 
the typical case when Y(z) is a ratio of polynomials in z, and the most powerful method, the 
method of residues. This last method, along with the residue method for inverse Laplace 
transforms, is the topic of Section A.3 of this appendix. 


A.2 CONTINUOUS MATHEMATICS 


The intent here is to review some ideas from the integral calculus of one- and two-dimensional 
functions of real variables. 


Sec. A.2. CONTINUOUS MATHEMATICS A-5 





Definite and Indefinite Integrals 


In a basic calculus course, we study two types of integrals, definite and indefinite: 


J x?dr = a? +C indefinite, 


b 1 1 
f rdr = 3° — 37 definite. 


In this course we will most always write the definite integral, almost never the indefinite 
integral. This is because we will use integrals to measure specific quantities, not merely to 
determine the class of functions that have a given derivative. Please note the difference 
between these two integrals. Unlike the indefinite integral, the definite integral is a function 
of its upper and lower limits, but not of x itself! Sometimes we refer to x in our definite 
integrals as a “dummy variable” for this reason, that is, z could just as well be replaced by 
another variable, say y, with no change resulting to our definite integral, that is, 


b b 
[+a] y’dy. 


To compute the definite integral we first compute the indefinite integral and then subtract 
its evaluation at the lower limit from its evaluation at the upper limit. 

In elementary calculus courses it is often not stressed that definite integrals are oper- 
ations on sets and that there are integrals that are not associated with the “area under a 
curve,” that is, so-called Riemann integrals. Consider the definite integral 


b 
I= f f(a)de(z). 


Here the set of points is {x : a < x < b} and the integral is computed by assigning numerical 
values to the points in an n-partition of the interval (a,b) vis-a-vis Ar = (b — a)/n in the 
set, for example 


In(a,b) = ÑD f(iAa) x (z(iAz + Ax/2) - z(iAz - Ax/2)) —_. 


where iAz,iAx + Av/2 € {xz : a < x < b}. If z(x) = z, then I becomes the well-known 
“area under the curve” Riemann integral. But in some cases the Riemann integral won’t 
suffice. For example, consider the expectation operation we encountered in Chapter 4, that 
is, E[X] = f° xf2(x)dx, which will converge to the desired result if fx (zx) is well-defined, 
that is, a bounded function with only a finite set of discontinuities etc. But if fx (x) does 
not fall into this category of functions, we can still compute E[X] from the integral E[X] = 
JS zdFx (x), where Fx (zx) is the CDF of X. This type of integral is called a Stieljes integral 
and is a generalization of the “area under the curve” type integral that is taught in beginning 
calculus courses. For example, if Fx (xz) = (1 — e~**)u(z), then dFx (x) = (Ae~>*)u(x)dz 
and E[X] = f>° tke" dx = 1/2. 
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Differentiation of Integrals 


From time to time, it becomes necessary to differentiate an integral with respect to a 
parameter which appears either in the upper limit, the lower limit, or the integrand itself: 


Zw) i) ow + [zen of(z, Y) 4 dz. (A.2-1) 
dy 


~ fay), y) -y w 


F f(z, yde = fol) ES 
This important formula is derived by recalling that for a function I(b,a, y) where in turn, 
b = b(y) and a = a(y) are two functions of y, we have 
dI ðIdb Ilda , OI 


dy ôðbdy  Oady dy 


If we denote 


then clearly 


a (bly), y) 

ÎL = —f(aly), v) 

al a po d bo) a d 
ðy Əy Jat) Haylie = bo ay vee 


The last step on the right follows from treating b(y) and a(y) as constants, since the variation 
of J arising from its upper and lower limits is already counted by the first two terms. 

An example of use of this formula, which arises in the study of how systems transform 
probability functions, is shown next. 


Example A.2-1 
Consider the example where the function f(x,y) = x + 2y, 


ð f Y 
ay f (x + 2y)°dz = (y + 2y)?1 — (0 + 2y)°0 + J A(x + 2y)dx 
o 0 


Y 
= (3y)? +4 (52 + 2v2) 





0 
= (3y)? + 2y? + 8y? = 19y? 
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Integration by Parts 


Integration by parts is a useful technique for explicit calculation of integrals. We write the 
formula as follows: 


b b 
f u(x)dv(z) = u(x)v(2x)|° -f v(z)du(z), (A.2-2) 


where u and v denote functions of the variable x with the integral extending over the range 
a < x < b. This formula is derived using the chain rule for derivatives, applied to the 
derivative of the product function u(x)v(x). An example is shown below. Integration by 
parts is useful to extend the class of integrals that are doable analytically. 


Example A.2-2 
Consider the following integration problem: 


co 
f ze "dz 
0 


Let u(x) = x and du(x) = e~?*dz; then using the above integration by parts formula we 


obtain 
f ze "dz = x | —-=e** -f —=e ** | dx 
0 2 o 0 2 
CO 








Completing the Square 


The method of completing the square is applied to the calculation of integrals by trans- 
forming an unknown integral into a known one by turning the argument of its integrand 
into a perfect square. For example, consider making a perfect square out of z? +42. We can 
transform it into the perfect square (x + 2)? by adding and subtracting 4, that is, 


r? +4r = (z +2}? — 4. 


To see how this polynomial concept can be used to calculate integrals, consider the well- 
known Gaussian integral that we often encounter in this course: 


+00 1.2 
J e 27 dg = v2r. 
00 

If, instead we need to calculate 


+oo 1 (27442) 
J e 2 Tt) dr =?, 


— oo 
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we can do so by completing the square as follows: 
too 1 2 
e? f e73 +42+4) dr 
—oo 
where we have multiplied by e~? inside the integral and by e? outside. Then we continue, 
too 1 2 
= e | e7 2(e+2)" dg. 
= 


With the change of variables y = z + 2, this then becomes 


+0 
_1y2 
=e f e 24 dy 


-00 


= evr. 


Double Integration 


Integrals on the (x, y) plane are properly called double integrals. The infinitesimal element 
is an area, written as dxdy. We often evaluate these integrals in some order, say z first and 
then y, or vice versa. Then the integral is called an iterated integral. We can write the three 
possible situations as follows: 


[e E rose) = f° f renia f ("stent as 


where the integral in the middle is the true double or area integral. Since limiting oper- 
ations are the basis for any integral, there is actually a question of whether the three 
two-dimensional integrals are always equal. Fortunately, an advanced result in measure 
theory [9-1] shows that when the integrals are defined in the modern Lebesgue sense, then 
all three either exist and are equal, or do not exist. We will consider only the ordinarily 
occurring case where the above three integrals exist and are equal. 

Note that on the left, when we integrate on z first, that the limits are interchanged 
versus the situation on the right where we integrate in the y direction first. The double 
or area integral in the middle, adopts the notation that one reads the limits in x, y order, 
just as in the function arguments and the area differential drdy. Thus, there should be no 
confusion in interpreting such expressions as 


3 p5 
f f ze dz dy, 
1 Jo 


since we would read this correctly as an integral over the rectangle with opposite corners 
(x,y) = (1,0) and (x,y) = (3,5). 


Functions 


A function is a unique mapping from a domain space @ to a range space Y% The only 
condition is uniqueness which means that only one y goes with each z, that is, f(x) has one 
and only one value. An example is f(x) = z?. A counterexample is f(x) = +yz. 
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Monotone Functions. A monotone function of the real variable x is one that always 
increases as x increases or always decreases as x increases. The former, with the positive 
slope, is called monotone increasing, while the latter, with the negative slope, is called 
monotone decreasing, as illustrated in Figures A.2-1 and A.2-2. If a function is monotone 
except for some flat regions of zero slope, then we use the terms monotone nondecreasing 
or monotone nonincreasing to describe them, as illustrated in Figure A.2-3. 


Inverse Functions. A function may or may not have an inverse. The inverse function exists 
when the original function has the additional uniqueness property that to each y in Y there 


corresponds only one x (in &). This allows us to define an inverse function f—!(y) to map 


fix) 


0 x 


Figure A.2-1 Example of a monotone increasing function. 


F(x) 


0 x 


Figure A.2-2 Example of a monotone decreasing function. 





}<-flat region» 


Figure A.2-3 Example of a monotone nonincreasing function. 
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back from & to & We note that a sufficient condition for the inverse function to exist is 
that the original function f(x) is monotone increasing or monotone decreasing. The function 
sketched in Figure A.2-3 does not have an inverse due to the flat section of zero slope. 


A.3 RESIDUE METHOD FOR INVERSE FOURIER TRANSFORMATION? 


In Chapters 8 and 9, we defined the power spectral density (psd) S(w) for both discrete and 
continuous time and showed that the psd is central to analyzing LSI systems with random 
sequence and process inputs. We often want to take an inverse transform to find the corre- 
lation function corresponding to a given psd to obtain a time-domain characterization. This 
section summarizes the powerful residue method for accomplishing the necessary inverse 
Fourier transformation. 

We start by recalling the relation between the psd and correlation function for a WSS 
random process, 

+00 
S(w) = R(r)e?”" dr, 


+00 
R(r) = = S(w)etI"? dr. 
27 Jo 


To apply the residue method of complex variable theory [A-3] to the evaluation of the above 
IFT, we must first express this integral as an integral along a contour in the complez s-plane. 
We define a new function S of the complex variable s = o + jw as follows. 

First we define S(s) on the imaginary axis in terms of the function of a real variable 
S(w) as 


S(s)|saju € S(w). 


Then we replace jw by s to extend the function S(jw) to the entire complex plane. Thus, 


+00 
S(8)|s=jw = S(w) = R(r)e72"" dr 
so 
+20 
S(s) = R(r)e~*” dr, (A.3-1) 


which is the two-sided Laplace transform of the correlation function R. Also by inverse 
Fourier transform, 


1 ft” 
— _. eT ag 
Ra) = 5 | Solase du), 
ao (A.3-2) 
= inj joo S(s)e*" ds, 


which is an integral along the imaginary axis of the s-plane. 


tThis material assumes that the reader is familiar with the discussions in Chapters 8 and 9. 
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The integral in Equation A.3-2 is called a contour integral in the theory of functions 
of a complex variable [A-2] [A-3], where it is shown that one can evaluate such an integral 
over a closed contour by the method of residues. This method is particularly easy to apply 
when the functions are rational; that is, the function is the ratio of two polynomials in s. 
Since this situation often occurs in linear systems whose behavior is modeled by differential 
equations, this method of evaluation can be very useful. We state the main result as a fact 
from the theory of complex variables. 


Fact 


Let F(s) be a function of the complex variable s, which is analytic inside and on a closed 
counterclockwise contour C except at P poles located inside C. The contour C encircles the 
origin. The P poles are located at s = p;,i = 1,..., P. Then 


1 


may $ Od = Z Res[F (s); s = pil, (A.3-3) 


pi inside 
c 


where 


1. at a first-order pole, Res[F (s); s = p] = [F(s)(s — p)]|s=p; 
2. at a second-order pole, Res[F(s);s = p) = #[F(s)(s — p)]|,=p; and at an nth order 
pole 


3. Res[F(s);s = p] = way (Zn lF(s)(s - p)"])| 


s=p 


In applying these results to our problem we first have to close the contour in some fashion. 
If we close the contour with a half-circle of infinite radius Cz as shown in Figure A.3-1, then 


a 


w=Imis) 





s-plane 


Figure A.3-1 Closed contour in left-half of s-plane for T>0. 
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provided that the function being integrated, S(s)e*", tends to zero fast enough as |s| — +00, 
the value of the integral will not be changed by this closing of the contour. In other words, 
the integral over the semicircular part of the contour will be zero. The conditions for this 
are |S(s)| stays bounded as |s| — +00, and 


le""|—-+0 as Re(s) — —co, 
the latter of which is satisfied for all r > 0. Thus, for positive 7 we have 


R(T) = a f, S(s)e ds = > Res[S(s)e*"; s = pil, 


inside Cy 


Similarly, for 7 < 0 one can show that it is permissible to close the contour to the right as 
shown in Figure A.3-2, in which case we have 


je |—0 as Re(s) > +00, 
so that we get 
1 
R(T -5$ S(s)e*"ds = — Res[S(s)e®; s = p; 
)= zg $, 900 I Res[S(s) :] 
inside CR 
for T < 0, the minus sign arising from the clockwise traversal of the contour. 


Example A.3-1 
(first-order psd) Let 








S(w) = 2a/(a? + w), 0<a<l. 


w=Im(s) 








s-plane 


o =Re(c) 


Figure A.3-2 Closed contour in right-half of s-plane for r < 0. 
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Figure A.3-3  Pole-zero diagram. 


Then 
S(8)|s=jw = S(w) = 2a/(a? + w*) = 2a/(jw + a)(-jw + a), 
sO 
2a 


S(8) = Fas +a)’ 


where the configuration of the poles in the s-plane is shown in Figure A.3-3. 


Evaluating the residues for 7 > 0, we get 
_  2ae*™ 
(-s +a) |s=—a 


— 2a QAT 
= — e ; 
2a 


R(T) = Res[S(s)e*"; s = —a] 


while for 7 < 0 we get 








R(T) = —Res{S(s)e*"; s = +a] 
__ _2ae” (s-a) 
ETEEN 44 
_  —2ae” 20 ar 
~ (s+a)(-1)| -a 2a 
Combining the results into a single formula, we get 
R(T) = exp(—aļr|),  —00 < T < +00. 





Inverse Fourier Transform for psd of Random Sequence 


A-13 


In the case of a random sequence one can do a similar contour integral evaluation in the 


complex z-plane. We recall the transform and inverse transform for a sequence: 
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+00 , 
S(w) = > Ri[mje7*™, 


+r , 
Rim] = = / Swett dy, 


We rewrite the latter integral as a contour integral around the unit circle in a complex plane 


by defining the function of a complex variable, S(z)|,~es» E S(w), and then substituting z 
for e?” into this new function to obtain the psd as a z-transform, 


+00 
S(z) = > Ri[m|z~-™ and 


Rm] = z f 5027az where C = {|z| = 1}. (A.3-4) 


In this case the contour is already closed and it encircles the origin in a counterclockwise 
direction, so we can apply Equation A.3-3 directly to obtain 


R[m] = > Res[S(z)z™—*; z = pi], 
inside c 
where the sum is over the residues at the poles inside the unit circle. This formula is valid 
for all values of the integer m; however, it is awkward to evaluate for negative m due to the 


variable-order pole contributed by z™—1 at z = 0. Fortunately, a transformation mapping 
z to 1/z conveniently solves this problem, and we have [A-1], 


Rim] = ii f, Seem (22), 


— 1 —1l)\,-—m-1 
= On $s )z dz, 


avoiding the variable-order pole for m < 0. We thus arrive at the prescription: 
For m > 0 
Ri[m| = D Res[S(z)z™—*; Z = pi], 
inside unit circle 
and for m < 0 
Rim] = D Res|[S(z~1)z~™—}; z = pz]. 


i:poles 
outside unit circle 


Example A.3-2 
(first-order psd of random sequence) We consider a psd given as 


_ (1-9?) 
(w) = (1 + p2) — 2pcosw’ 
which is plotted in Figure A.3-4. 


w| <7, (A.3-5) 
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—4 -3 —2 -1 0 1 2 3 4 


Figure A.3-4 Plot of psd S(w) for a p value in (0,1). 


Using the identify cosw = 0.5(exp jw + exp—jw), we can make this substitution in 
Equation A.3-5 to obtain the function of a complex variable, 


2(1 — p? 
S(2)lz=e3 = S(w) = ara i 
2(1 — p”) 


OFA) — ete pete) 


Then we replace ef” by z to obtain the function of z, 


E 2(1 — 2°) 
SO) = TFF 
e Ge 


The z-plane pole-zero configuration of this function is shown in Figure A.3-5. The overall 
transformation from S(w) to S(z) is thus given by the replacement 


cosw e $(z+274). (A.3-6) 


For m > 0 we get 
R[m] = Res[S(z)z™—?; z = p] 
m—1 
= (p7! — p) LP 
(=o) (e — p7!) 


m 


= 2p”. 
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z-plane 


+ Vp Re(z) 


Figure A.3-5 z-plane. 


For m < 0 we have 


Rim] = Res[S(z—!)z-™—}; z = p), 


since z = p7! is the one pole outside the unit circle. 
Now 


zl 


(271 — p)(z — p~?) 


S(1/z) = —2(p~* — p) 
= +2(p7" — p) 
which could easily have been foretold from the symmetry evident in Equation A.3-6. Then 


Res[S(z~*)z7~™—1; z = p) = —2(p7! — p 2 (z=) — P) 
Be) J=- A eM ecg 
_ _9(e' — 2l 
(o—p™) 
= 20 ™. 
Combining, we get the overall answer 


Rim] = 2!™!, ~o < m < +00. 
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A.4 MATHEMATICAL INDUCTIONT 


Many proofs in probability are obtained by mathematical induction. Mathematical induction 
is a method for obtaining results, especially proving theorems, which are difficult if not 
impossible to get by any other method. For example: It is claimed that the set S contains 
all the positive integers. How would we verify this? We could show that 1 € S,2€ 5,3 € 5, 
etc. But using this procedure would not allow us to finish in finite time. Instead we can use 
the general principle of matematical induction: 

Let {Cp} be an infinite sequence of propositions, given for all k > 1. We wish to prove 
that these propositions are true for every k > 1. Instead of proving them one by one, we 
rely on the following principle. 


(i) If Cı is true, 
(ii) and for arbitrary k > 1, “Cx is true” implies “C,41 is true,” 


then Ck holds for all k > 1. 


Thus, we only have to perform the two steps (i and ii), using mathematical induction. 
After identifying the indexed set of propositions{C,} for our particular problem, we first 
show that C; is true. Then we try to show the second step is true. We do this by assuming 
that Cx is true for an arbitrary value of positive index k, and then attempting to show that 
this fact implies that proposition C,+1 is true. Then we are finished. 


Example A.4-1 
(mathematical induction) Show that if 0 < a < b then a* < b} for all positive integers n. 





Solution We choose the method of induction. The problem statement that 0 < a < b 
gives us directly the proposition Cı = {a < b}, then we let Cp be the set of positive integers 
for which af < b*, that is, C, £ {aF < b£}. Now assume that Cp is true, meaning a* < b* for 
some k. It then follows that a*+! = a xa¥ < a x b! < bx b* = b¥+!. Thus, Cz41 is true. The 
principle of mathematical induction then allows us to conclude that all the propositions Ck 
are true, that is, a* < b*, for all positive integers k. 
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NIJD: Gamma and Delta 
Functions 


B.1 GAMMA FUNCTION 


The Gamma function I (a), for real a, is defined by the integral [1,2] 
a f” a—-1,-t 
T(a) = f te ‘dt, : (B.1-1) 
0 


where a > 0. From Equation B.1-1 we see that ['(1) = 1. If we integrate (a + 1) by parts 
we obtain 


o0 oo [e o] 
Tía +1)= f t“e™*dt = -te= + a f t°—1e~*dt 
0 0 0 
= al(a), 


Hence, for positive integer k, 


T'(k) = (k -1)}! 


For values of the argument between the integers, the gamma function does a smooth inter- 
polation. It is available in MATLAB as the function gamma. 

Therefore, note that 0! = 1. We leave it to the reader to show that ['(0.5) = 
Vn and ['(1.5) = ,/7/2. The Gamma function is sometimes called the generalized factorial 
function. 


B-1 
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B.2 INCOMPLETE GAMMA FUNCTION 


The (upper) incomplete Gamma function I'(a, x) is defined by the integral 
A oO 
Tr(a, x)= f t te™dt, 
T 
where a > 0. The (lower) incomplete Gamma function is defined by 


yla, z) -=f t%e ‘dt. 
0 


Unless stated otherwise incomplete Gamma function will mean the upper incomplete Gamma 
function. Clearly ['(a@,0) =T'(a). For a = k an integer, the incomplete Gamma function is 
known to satisfy the series [3, 4] _ 


k-1 | 
—g T 
T(k, x) = (k — 1)le 2 m’ 


which can also be written as 
T(k, x) = (k — 1P (a — 1) + gt 1-2 


and it is available in MATLAB as the function gammainc. This function plays a crucial role 
in evaluating the distribution function of the Poisson random variable. 


B.3 DIRAC DELTA FUNCTION 


The Dirac delta function 6(x) is often defined as a “function” that is zero everywhere except 
at x = 0, where it is infinite such that 


i ô(x)dz = 1. 


The mild controversy about regarding ô(x) a “function” in the ordinary sense is partly due 
to it not being of bounded variation and not having bounded energy in any finite-length 
support that contains it. Another definition is to regard d(x) as the limit of one of several 
pulses. For example, with rectangular window, 


w (2) A fa —b/2 < x < b/2, 


b 0, else, 


b 


we can define 6(z) as 
6(x) 2 Jim {aw(az)}. 
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Figure B.3-1 Rectangular and Gaussian-shaped pulses of unit area. 


Another possibility is to define 6(x) as 
6(2) Ê lim {aexp(—za?2?)}. 
awe 


The rectangular and Gaussian shaped pulses are shown in Figure B.3-1. The function 
aw(az) has discontinuous derivatives, whereas aexp(—7a?z”) has continuous derivatives. 
The exact shape of these functions is immaterial. Their important features are (1) unit area 
and (2) rapid decrease to zero for z Æ 0. 

Still another defintion is to call any object a delta function if for any function f(-) 
continuous at z it satisfies the integral equation! 


f 7 FOS- 2) dy = f(2). (B.3-1) 


This definition can, of course, be related to the previous one, since either of the pulses 
when substituted for ô(x) in Equation (B.3-1) will essentially furnish the same result when 
a is large. This follows because the integrand is significantly nonzero only for x ~ y. The 
integral can, therefore, be approximately evaluated by replacing f(y) by f(r) and moving 
it outside the integral. Then, since both pulses have unit-area, the result follows. Note that 
6(x) = 6(—2). 

Consider now the unit step u(x — x;), which is discontinuous at xz = z; with u(0) 41 
(Figure B.3-2a). The discontinuity can be viewed as the limit of the function shown in 
Figure B.3-2b. The derivative is shown in Figure B.3-2c. 

The derivative of the function shown in Figure B.3-2b is given by 








dF| a dF(xi) _ lim y ZX; 
dz z: E dz; ~ Az—0 Az; Ag (B.3-2) 


tA word of caution is in order here. Since 6(x) is zero everywhere except at a single point, its integral 
(in the Riemann sense) is not defined. Hence, Equation B.3-1 is essentially symbolic, that is, it implies a 
limiting operation as was done with the rectangular and Gaussian pulses. 
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Xi x 


(a) 





Xi x 


(c) 


Figure B.3-2 (a) unit step u(x — x); (b) approximation to unit step; (c) derivative of function in (b). 


Thus, formally, the derivative at a step discontinuity is a delta function with weightt 
proportional to the height of the jump. It is not uncommon to call (x — z;) the delta 
function at “x,;.” 

Returning now to Equation 2.5-7 in Chapter 2, which can be written as 


F(z) = > Pe(ai)u(a — 2) 


and using the result of Equation B.3-2 enables us to write for a discrete RV: 


e) 


f(z) = = Do Px (2i)6(# — zi), (B.3-3) 


where we recall that Px (x) £ F(x;) — F(a; ) and the unit step assures that the summation 
is over all i such that z; < z. 


tit is also called the area of the delta function. 
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IMJ Da Functional 
Transformations 
and Jacobians 


C.1 INTRODUCTION 


Functional transformations play an important role in probability theory as well as many 
other fields. In this appendix, we shall review the theory of Jacobians, beginning with a 
two-function-to-two-function transformation and extending the result to the n-function-to- 
n-function case. First, we should recall two basic results from advanced calculus: 


Theorem C.1-1 Consider a bounded linear transformation L from E” to E”. If D is 
a bounded set in E” with n-dimensional volume V(D), then the volume of L(D) is merely 
k x V(D), where k is a constant independent of D. W 


Theorem C.1-2 If T is a transformation of class C! from E” to E” in an open set D 
then, at every point p € D, dT is a linear transformation from E” to E”. E 


The first theorem states that the effect of L is merely to multiply the volume by a 
constant that doesn’t depend on the shape of D. The second theorems states that, at the 
differential level, even nonlinear transformations become linear, provided that the transfor- 
mations consist of differential functions. Both theorems will find application in this develop- 
ment. 
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Infinitessimal rectangle. Mapped infinitessimal rectangle 
into an infinitessimal parallelogram. 


Figure C.2-1 


C.2 JACOBIANS FOR n = 2 


Consider the pair of one-to-one! differentiable functions v = g(z,y), w = h(z,y) with 
the unique inverse z = ¢(v,w), y = y(v,w). As the vector z = (v,w) traces out the 
infinitesimal rectangle R in the v’—w’ plane, the vector u = (zx, y) traces out the infinitesimal 
parallelogram Š in the z’-7 plane. By Theorem C.1-1, this differential transformation is 
linear, and by Theorem C.1-2, the ratio of the areas, A(&)/A(8), is a constant. We shall 
denote this constant by |.J| and compute its value. 

We can compute the constant J with the aid of Figure C.2-1. Recalling that z = ¢(v, w), 
y = plv, w), we compute the image points P1, P2, P, of the vertices at P4, P2, P3 as: 


b = P, = a6 ap Pp, — oo dp 
Pı =(2,y), Po= (z+ gu Y t Ed), P; = (z+ py ey + Law). 


These results are directly obtained by a Taylor series expansion about (x, y). Thus, for 
example, the coordinates (£2, y2) of Pz are obtained from 


ô ð 
z2 = (uv + dv, w) = (v, w) + t dv and y2 = p(v + dv, w) = p(v, w) + dv. 
There are no nonzero derivatives with respect to w because w is held constant in going 


from P; to P2. A result from vector analysis, is that the area of a parallelogram spanned 
by the vectors vı and v2 is given by the magnitude of the cross-product, that is, 


A(S) = |vi x val = (Zia + PP iav) x (2 idw + Liaw), 





where we used the fact that vı = P, — P} and v2 = P, — P,. The unit vectors i, j satisfy 
ixj=k,jxi=—k,ixi=jx j= 0, where k L i,j and points out of the plane of the 
paper. Thus, 


tThis means that every point (x,y) maps into a unique (u,v) and vice versa. 


Sec. C.2. JACOBIANS FOR n = 2 Cc-3 





0¢0p 06 öp 
Bu ðw Ow dy | E A 


Since A(R) = dv dw, we find that the ratio of the areas is 


0¢0p aş õp 


ðv ðw Ow ðv 


A(3) = 





A(S) /A(R) = 2j. 








In higber dimensions it is easier to write J as a determinant. Indeed, even in this 
two-dimensional case, we can write: 


06 ð$ 
ðu dw|_ 0699 _ ð$ ay 


ðp p| Avdw dw dv 
Ov Ow ‘ 


Se 
I} 


The quantity J is called the Jacobian of the transformation z = ¢(v,w), y = (v, w). 
Among other things, the Jacobian is necessary to preserve probability measure (some- 
times called the probability mass or probability volume). For example, consider a pdf 


fxyv(z,y) and the transformation z = ¢(v,w), y = y(v,w). Consider the event B £ {¢: 
(X,Y) € p C E*}. Then 


p(B)= | | terewazay ¢ | | ter (6o,w),o(v,w))dvdu 
p p 


because the volume dz dy # dv dw. What is needed is the Jacobian to create the equality 
among the integrals as 


[ [ txv@ndedy= f | freou) oedd. 


Sometimes it may be easier to deal with the original functions v = g(x,y), w = A(z, y) 
than the inverse functions z = ¢(v, w), y = (v, w). To get the desired result, we recompute 
the ratio of areas by considering the image, R’, in the v-w system, of an infinitesimal 
rectangle, 9’, in the z-y system (Figure C.2-2). Following the same procedure as before, we 


obtain A(9’)/A(R’) 41 /|J|, where the primes help indicate the regions in the two systems 
and J is given by 


99 Əy 
Ox Oy 
J= ðh Əh . 
ðr Oy 


But, by Theorem C.1-1, A(S’)/A(’) = A(S)/A(R) and, hence, |J] = 1/|J| or |JJ| = 1. 
We leave the details of the computation as an exercise for the reader. 
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Infinitessimal parallelogram. Infinitessimal parallelogram is mapped 
into an infinitessimal rectangle. 


Figure C.2-2 


C.3 JACOBIAN FOR GENERAL n 


The general case is easier to deal with if we allow ourselves to use matrix and vector notation 
and some results from linear algebra. First, it is not convenient to use the unit vectors i, j, k 
in higher dimensions. Instead, we use unit vectors that are represented by column vectors. 
Thus, in Æ? we use e; = [1,0]7 and ez = [0,1]?. Then 


_ ð$ 3p ap, p |" 
vı = m 1 + By! 2 = Ez 3 av! 
and 
_ 96 ap 28 ay, 2? ay] 
V2 = Fut 1+ aw eo pad Aw" | 


Next, we form the 2 x 2 matrix V2 = [vı vo], where the subscript 2 on V2 refers to 
two-dimensional Euclidean space. a 

Then, for the special case of n = 2, A(S) is given by |det V2]. As we go to higher 
dimensions we drop the term “area of the parallelepiped” in favor of “volume of the paral- 
lelepiped,” although purists would argue that for spaces of dimensions higher than three we 
should use “hypervolume.” Also in higher dimensions, it is easier to use different subscripts 
rather than different symbols for functions and arguments. In n-dimensional space, the 
volume of a parallelepiped is always given by the height times the base area, where the base 
area is the volume of the parallelepiped in n — 1 dimensional space and the height is the 
length of the component of vn, which is orthogonal to the vectors that span E”-!. Thus, 
in Æ? the base area is the length of the chosen base vector and the height is the length 
of the orthogonal component of the second vector. In E’, the base area is the area of 
the parallelogram spanned by any two of the three vectors and the height is the length 
of the component of the third vector orthogonal to the plane containing the first two 
vectors. ° 
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We wish to compute the volume of an infinitesimal parallelepiped in n-dimensional 
space. Motivated by the fact that the volume, V2, in two-dimensional space is given by 
V2 = | det V2], we are tempted to write that Vn = | det V,,|. Is this true? The answer is yes 
and the proof is furnished by induction. Thus, we assume that V, = |det V,| is true and 
we must prove that Vn+ı = | det Vi4i|. Now in terms of the vectors v1, V2,°-- ,Wn)V¥n+1) 
the matrix V,,1 can be written as 


Vni = 








O -> O |Un+in+ 


To compute |detV,41| we expand by the bottom row to obtain |det Vasil = 
lon+1,v+1|| det V,,|, since all other terms in the expansion are zero. Now consider the vector 
Vn+1 in more detail. In terms of the unit vectors e;,€2,...,@n41, it can be written as 


n 


Vn+1 = Untijn4t1en4+1 + ` Un4+1,i €i; 
i=1 


where e; has a 1 in the ith position (row) and 0’s in the remaining n positions. But en41 
is the unit vector orthogonal to the e1, e2,...,e,, and hence is orthogonal to the space 
spanned by them, and |un+1,n+1] is its height. Also recall that | det V,,| is the volume of the 
parallelepiped in n-dimensions and therefore represents the base area in n + 1 dimensions. 
Hence | det Vany] = |[Un+1, n+ l| det Vn| is indeed height times base area and the proof is 
complete. 

Readers familiar with Hadamard’s inequality and the Gram-Schmidt orthogonalization 
procedure can furnish a faster, more direct, proof that avoids induction, but is less intuitive. 


Example C.3-1 
In Chapter 5 we considered the transformation 


yı = gı (£1, T2,- --, Zn) 


Y2 = g2(%1,%2,.--,£n) 


Yn = Gn(21, T2, cee Zn) 
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with unique inverse 


m= $1 (y1, Ye, - . Yn) 
T2 = aly, Yo; aoe Yn) 


Tn = On (YL Y2 se Yn) 


Then, a rectangular parallelepiped in the (y1,y2,..-,Yn) system with volume JT;"_, |dy:| 
maps into a parallelepiped in the (21,22,...,¢%,) system with volume |detV,| = 
|det[vi, v2,-.-; Vn]|. Here, by computing the differentials of the transformation, we obtain 
for the v;,i=1,...,n: 


= (41a, may) ix 
vi = (Fay... ray: ; i=1,...,n. 


MIJD Measure and 
Probability 





D.1 INTRODUCTION AND BASIC IDEAS 


Some mathematicians describe probability theory as a special case of measure theory. 
Indeed, random variables are said to be measurable functions; the distribution function 
is said to be a measure; events are measurable sets; the sample description space together 
with the field of events is a measurable space; and a probability space is a measure space. 
In this appendix, we furnish some results for readers not familiar with the basic ideas of 
measure theory. We assume that the reader has read Chapter 1 and is familiar with set 
operations, fields, and sigma fields. The bulk of the material in this appendix is adapted 
from the classic work by Billingsley.t 

Let Q be a space (a universal set) and let A, B,C,... be elements (subsets) of Q. Also, 
as in the text, let @ denote the empty set. Let S be a field of sets on Q. Then the pair 
(Q, 3) is a measurable space if S is a o—field on Q. Let u be a set function? on $S. Then u 
is a measure if it satisfies these conditions: 


(i) Let A € S, then [A] € [0, 00); 
(ii) [$] = 0; 


t Patrick Billingsley, Probability and Measure. New York: John Wiley & Sons, 1978. 
+A set function is a real valued function defined on the field & of subsets of the space 2. 
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(iii) if A1, Ag,... is a disjoint sequence of sets in S and if UZ; Ax E€ S, then 


u IU ay = 5 uli]. 
k=1 k=1 


This property is called countable additivity. A measure p is called finite if [O] < œ; 
it is infinite if [OQ] = oo. It qualifies as a probability measure if p{Q] = 1, as denoted in 
Chapter 1. If $ is a g-field in Q, the triplet (Q, S, p) is a measure space. 

Countable additivity implies finite additivity, that is, 


U Ay = 5 le] 
k=1 k=l 


if the sets are disjoint. A measure p is monotone, that is u[A] < [|B] whenever A C B. 


The proof of this statement is straightforward. Write, as is customary in the literature on 


measure theory, BAC Ê B — A and B = (B — A) JAB = (B — A)UA. Then, since A 


and B-A are disjoint, it follows that »[B] = „|B — A] + a[A] > pA]. Also, since AL) B= 
(A — B)U(B — A) UAB, it follows that [AU B] = a[A — B] + p[B — A] + p[AB]. This 
result can be extended to many sets in a o-field, (sets in a o-field are called o-sets), that. is, 


H Ù Ak 
k=1 


Of course, this equation makes sense only if the sets have finite measure. It is also straight- 
forward to show that y[-] has the property of subadditivity: 


H 








k=1 


i<j 





Example D.1-1 
Lebesgue measure. Consider the o-field, S, of intervals on Q = (0,1). The elements of S are 
called linear Borel sets and the o-field of intervals is called the Borel field .@ We shall use 
this notation for any o-field on the real line. A measure p[-] on S is p = (a, b) Ê b—a, where 
b > a. This measure is called the Lebesgue measure on (a, b]. It can be directly generalized 
to the real line R!. An extension of the Lebesgue measure to k-dimensional Euclidean 
space is: 





k 

H= Arle : a; < zi S bpi = 1,...,k] 2 J (e - a:) 
i=1 

Thus, the Lebesgue measures are length (k = 1), area (k = 2), volume (k = 3), and hyper- 

volume (k > 3). We denote the associated o-field generated by these generalized rectangles 

by the symbol .@*. 
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There are many important theorems regarding measures. We cite several below. 


Theorem D.1-1 (Translation invariance.) Let A € .@* and define A + z 2 {a+z: 
a € A}. Then \,,(A + x) = A;,(A) for all translation vectors z. E 


Theorem D.1-2 (Lebesgue measure of transformation.) Let T : R* — RF denote a 
linear and nonsingular transformation from the Euclidean space R* to R*. Then A € .@* 
implies that TA € F and A,(TA) = |detT|-A,(A). For example, if T is a rotation, 
or reflection, that is, an orthogonal or unitary transformation, then |detT| = 1 and 
A (TA)=.(A). E 


Theorem D.1-3 (Lebesgue Measure of Subspaces of R*). Every (k — 1) dimensional 
hyperplane has k-dimensional Lebesgue measure zero. W 


Theorem D.1-4 (Continuity of measure.) (i) Let p be a measure on a field $. Then 
if A, and A lie in S and A, f A, then pļAn] T? u[A]. This is called continuity of measure 
from below. A, T A means that A,_1 C An C Anyi C e and 


A= D An 
n=1 


Likewise, y[A,] T u[A] means that u[Ap] < w[Anyi] < [A] and lim p[A,] = p[A]. 

(ii) Let u be a measure on a field GS. Then if A, and A lie in S and A, | A, then 
plAn] | [A]. This is called continuity of measure from above. An | A means that A,_1 D 
An D Anyi D- and 

A=[]An 
n=1 


Likewise, u[An] | [A] means that u[An] > u[Anyi] > [A] and lim p[A,] = [A]. m 


Measurable Mappings and Functions 


Let (Q, 9) and (9’, 3’) be two measurable spaces with two sets A € S$ and A’ € Y. Fora 
mapping T : Q — Q, consider the inverse image T~1 A’ = {w E Q : Tw € A’} for A) c W. 
The mapping is measurable if T~1A’ € S for every A’ € &’. For example, consider the unit 
interval Q = (0,1) with S = @ and the mapping Tz = z?. Here, Y = Q and YF = Z 
Clearly, the inverse image of every Borel interval in Q’ is a Borel interval in Q. Hence, T is 
a measurable mapping. 

A real function X on Q, with image space R}, is said to be measurable if its inverse 
image X-'B={w:X(w)€ B} €$ for every BES. 


D.2 APPLICATION OF MEASURE THEORY TO PROBABILITY 


A set function P on a ø-field S is a probability measure if: 


(i) 0 < P[A] < 1 for every A € 9; 
(ii) P[¢] =0, P(Q) = 1; 
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(iii) if Ay, Ao,..., Ax,... is a disjoint sequence of S-sets such that 
xo , 
LJ Ares 
k=1 
then 


o0 lee) 
P IU a = $ P[A4]. 
k=1 k=1 
(This is the countable additivity property of the probability measure.) 


Distribution Measure 


In keeping with the notation in the main text, we replace w with ¢ to denote the elements of 
Q. Recall that this was done to save w for the Fourier transform variable needed throughout 
the text. Let B € .@, the Borel o-field of intervals on the real line. Consider a (probability) 


measure u on (R!,.#) defined by u[B] 4 P{¢ : X(¢) € B] = Px[B]. This measure is called 
the distribution or law of a random variable. The distribution function of X is defined by 


Fx (x) Ê u(—00, a] = PIX < a], 


where P[X < z] is short for P| : X(¢) < x]. By the continuity from above part of the 
continuity of measure theorem, Fx (x) is continuous from the right. 

Since the field of events is a o-field, and the distribution function is generated by a 
measure, all of the properties of measures apply in probability. It is for this reason that 
probability and measure theories are so closely related. However, to look at probability 
theory just from the point of view of measure theory is to ignore its rich calculus which 
enables the solution of engineering, scientific, and statistical problems. 





XJ J4\))yaa Sampled Analog 
Waveforms and 
Discrete-time Signals 


Discrete-time signals are often realized by sampling continuous-time analog wave forms. 
Here, we briefly review the relationship between the two types of signals. The reconstruction 
of a continuous-time signal from its equally-spaced samples is governed by the famous 
Whittaker-Nyquist-Shannon sampling theorem, which states the following. 


Theorem E.1-1 A continuous signal x(t) with real frequencies no higher than Umax 
can be reconstructed exactly from its samples x(nT) if the sampling interval T satisfies 
1 
The proof of this important theorem is given in many places, for example, Principles of 
Communication Engineering by John M. Wozencraft and Irwin M. Jacobs, John Wiley and 
Sons, NY, 1965. Let z(t), y(t), and h(t) denote the input signal, output signal, and impulse 
response of a linear, shift-invariant (LSI) system respectively. Let B, in Hertz, denote a 
bandwidth that is greater than any signal or system bandwidth encountered in the system 


and let A 21 /(2B). For ease of notation define 


sinc (z) 4m 





NT 


The relationship between input and output for an LSI system is 


y(t) = T. h(s)x(t — s)ds 
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and from the sampling theorem: 


y(t) = X` y(lA) sinc (2B[¢ — A), 
I 


a(t) = X` 2(IA) sinc (2B[t — 1A), 
l 


h(t) = X` A(LA) sinc (2BẸt — 1A). 
l 


If we insert the top three lines into the input-output integral being careful about using 
different subscripts, and evaluate at y(t) at t=1A, we obtain 


y(LA) = 5 > h(nA)z(mA)I(l,m, n), 


where co 
I(l,m,n) 4 J sinc (2B[s — nA]) sinc (2B[s — (l — m)A])ds = 0, 


for all real integers l, m, n except when l—m = n, whereupon it assumes the value A. Hence, 
we obtain the important result that 


y(lA) = > h(nd)z( - nJA)A, 


Often the factor A is submerged into h(nA). In a computer the sampled values of 
the functions become mere sequences of numbers as y(lA) 4 yll], z(lA) 4 zil], and 
h(nA) 4 h[n]. Then, we obtain 


yin] = $ hfna — n] 


that we recognize as a discrete convolution. The important fact to remember is that the 
processing of analog signals can be done by operating on their samples and then recon- 
structing an analog waveform by filtering. 

Another point to consider is that the sequence of numbers {z[n]} does not contain infor- 
mation about the sampling period, For example, consider the sinusoid z(t) = Acos(w,t+6). 
If we sample at t = nA, n = ...,—2,—1,0,1,2,..., we obtain the samples z(nA) = 
Acos(nAw, + 0) = Acos(nw + 6) 2 z[n], where w Ê Aw,. The radian “frequency” w is 
dimensionless, which is consistent with the dimensionless “time” n. It is well to remember 
that to convert to analog frequencies w, (radians/sec) or v, (Hertz) we must use w, = wA 
or v, = vA. For example, the Fourier transform of a sequence of numbers {z[n]} will yield 
a spectrum of sinusoids at normalized frequencies w that lie in the interval [—7, 7]. If we 
convert to analog radian frequencies, then the spectrum will lie in the interval [~2r B, 27 B). 





nda Independence of 
Sample Mean and 
Variance for Normal 
Random Variables’ 


Of all the distributions we encounter in probability and statistics, without doubt, the Normal 
(Gaussian) distribution is of greatest importance. There are a number of reasons for this, 
but first and foremost is the Central Limit Theorem (CTL), which states that under a set 
of reasonable and realistic conditions the sum of a large number of independent random 
variables tends to have a Normal CDF. This property enables us to solve many problems in 
statistics by invoking the CTL when the sample size is large. Readers of Chapters 6 and 7 
will have noticed that we use the CTL to generate results that otherwise would have been 
difficult to obtain. 

There are other reasons why the Normal! distribution plays such an important role in 
probability and statistics. One of them is that the univariate Normal pdf has two parameters 
that are algebraically independent, that is, within their range they can have any arbitrary 
values without conflicting with each other. The mean p can have any value in (—0o, 00) and 
the variance g? can have any value in (0,00). This suggests that we can always design a 
generator of Normal data that will have a specified mean and variance. The same is true 
for the multivariate, that is, multidimensional, Normal distribution. That is, given a mean 
vector and covariance matrix, respectively, p, K, we can always design a Normal generator 
whose data will have these parameters. The Normal pdf also enjoys completeness, a property 
of importance in finding a class of optimum estimators called minimum variance, unbiased 
estimators. 

Given the importance of the Normal distribution, the estimation of its parameters 
p and g? is a central problem in statistics. Assume that we make n i.i.d. observations 


tThe proof substantially follows that given in [7-1] 
F-1 
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on X: N(pu,07). We estimate u with(1/n) $; X: (sample mean) and and ø? with (1/n) 


2 

(Sta (i= 0/0 Eza) ) oF 0/0- D) (Shea (X -= A/M E71 X) ) (sample 
variance). We note that both the sample mean and sample variance use the same data. 
Remarkably, the sample mean and sample variance are statistically independentt . In 
proving this result we shall use a Theorem from probability theory: If the joint moment- 
generating function of two random variables V and W, say Myw (tı, te), factors as My (t1) 
Mwy (t2), then V and W are independent. This result was derived in Example 4.7-1 for 
characteristic functions i.e., moment generating functions evaluated at t = jw. 

The two random variables of interest are the estimators j:y and aX. For simplicity and 
to keep the algebra to a minimum, we define 


, a Xi— n 
(i) y a EX oven > x) = np, 
ses av > ` 
(iii) W = et (Yı — fy)” = (n - 16} 
We note in passing that V: x? and W: x2_,. Now recall that Myw (tı, t2) is given by 


Myw (ti, t2) = E [exp(tiV + t2W)] 


EPLren (30) [ntan 


where 


n t 2 n 1 n 2 
Q2 ye 1 vi 25 (Ei us) ~ Pte dein (u ~ (- D =1 w)) 


A n — 
= DM DDN 1 TitiYj = yTR !y where R is a covariance matrix 


with diagonal « elements Ti; and off-diagonal elements 7,;, i Æ j, where 


i 2(tı—t 
Tu = 1 — 2t — att); = 1,...,n, diagonal terms of R (F-la) 


2(ti—t 
Tij = -2h ; i,j =1,...,n;i # j off-diagonal terms of R. (F-1b) 


Recalling that the multidimensional Normal pdf is written as 


1 l Th- 
fx(y)= ry ARPA lexw(-5y R y)| 


t Independent or independence is meant in a statistical sense. Else we use algebraic or functional inde- 
pendence. 
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and that [°° fy(y)dy = 1, we conclude that 


Myw (tı, t2) =E (exp(tı V + t2W)) 


-F E f emere (-i2) | x dyidy2 dyn- 


= [RP 


From matrix theory, it is known that for any n x n matrix R with diagonal elements a 
and off-diagonal elements b, the determinant |R| is computed as (a — b)"~! (a + (n — 1)b). 
Substituting a 4 Tii b 4 rij (from Equation F-1) we obtain 


Myw (tı, t2) = (1 — 2t,)~1/2(1 — 2ta) 7 [(2—1)/21 tı < 1/2, te < 1/2, 
= My (t1) x My (t2). 


Hence by from the Theorem quoted at the beginning of the discussion we conclude that 
V and W are independent. Hence Fyw(v,w) = Fy(v)Fw(w) and therefore that fy and 
&% are independent. It can be shown that if pi and o% are independent then so are fi x 
and o%. This important result enables us to select separate confidence intervals for fix 
and 0% without fear of contradiction. The independence of fix and o% is true only in the 
Normal case. E 


IN J3N[>\).acen Tables of Cumulative 
Distribution 
Functions: the 
Normal, Student t, 
Chi-square, and F 





In the following pages we present tables of the CDF of the (1) Normal; (2) Student-t; (3) 
Chi-square; and the F, the latter sometimes called the Snedecor F distribution. 

The gamma function ['(a) = for x*-1e-*dz,a > 0 appears in several of the CDFs 
below. When a is an integer, say, a= m > 1, then '(m) = [m — 1]! = (m — 1) x (m — 2) x 
--+x 2x 1. Note 0!=1. Next to each CDF are a few of its applications. 

(1) Standard Normal (extensively used in probability and statistics) 


Fsy(z) = 5 f . exp (-$) ae 


The general univariate Normal CDF is a function of two parameters the mean p and the 


variance g°. 


(2) Student-t (interval estimation, tests on the means of Normal populations 
L = Ho Versus H # Ho) 


a T([m+1]/2) 


z 1 
Pradok] TETA “8 To e 


The Student-t distribution is a function of the parameter m called the degrees of freedom 
(DOF). It is a special case of the F-distribution. 
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(3) Chi-square (confidence intervals for variance of Normal populations, 
testing o? = o2 versus o? 4 02, Pearson’s goodness-of-fit) 


Fy2(2;m) = Kf y™/? exp (3) ay 
0 


1 
2™/20(m/2) 


I> 


K' 


The Chi-square CDF is a function of the parameter m called the degrees of freedom (DOF). 


4) Snedecor F (generalized likelihood ratio, testing o? = o2 versus o? + o2 
1=03 17% 


= —(m+n)/2 
Fr(x;m,n) =K f y” -x (1+ m) "i dy 
0 n 


(z 7 *) 

r 

K” A 2 m m/2 

“Tr (=) m () (7) 
2 2 

The Snedecor F CDF is a function of two parameters m and n. These are called the degrees 


of freedom (DOF) of the F-distribution. When referring to the DOF, the parameter m is 
quoted first. . 
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Table 1 Standard Normal CDF 


Fsn (zx) is the table entry. First digit of z gives the row, and second digit of x gives the position in 
the row. 
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Table 2 Student-t CDF 


For each F(x;n) given across the top of the table, row n then determines the table entry, the 
corresponding value of x. 








F 
n 0.60 0.75 0.90 0.95 0.975 0.99 0.995 0.9995 
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657 636.619 
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925 31.598 
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841 12.924 
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604 8.610 
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032 6.869 
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707 5.959 
7 0.263 0.711 1.415 1.895 2.365 2.998 3.499 5.408 
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355 5.041 
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250 4.781 
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169 4.587 
11 0.260 0.697 1.363 1.796 2.201 2.718 3.106 4.437 
12 0.259 0.695 1.356 1.782 2.179 2.681 3.055 4.318 
13 0.259 0.694 1.350 1.771 2.160 2.650 3.012 4.221 
14 0.258 0.692 1.345 1.761 2.145 2.624 2.977 4.140 
15 0.258 0.691 1.341 1.753 2.131 2.602 2.947 4.073 
16 0.258 0.690 1.337 1.746 2.120 2.583 2.921 4.015 
17 0.257 0.689 1.333 1.740 2.110 2.567 2.898 3.965 
18 0.257 0.688 1.330 1.734 2.101 2.552 2.878 3.922 
19 0.257 0.688 1.328 1.729 2.093 2.539 2.861 3.883 
20 0.257 0.687 1.325 1.725 2.086 2.528 2.845 3.850 
21 0.257 0.686 1.323 1.721 2.080 2.518 2.831 3.819 
22 0.256 0.686 1.321 1.717 2.074 2.508 2.819 3.792 
23 0.256 0.685 1.319 1.714 2.069 2.500 2.807 3.767 
24 0.256 0.685 1.318 1.711 2.064 2.492 2.797 3.745 
25 0.256 0.684 1.316 1.708 2.060 2.485 2.787 3.725 
26 0.256 0.684 1.315 1.706 2.056 2.479 2.779 3.707 
27 0.256 0.684 1.314 1.703 2.052 2.473 2.771 3.690 
28 0.256 0.683 1.313 1.701 2.048 2.467 2.763 3.674 
29 0.256 0.683 1.311 1.699 2.045 2.462 2.756 3.659 
30 0.256 0.683 1.310 1.697 2.042 2.457 2.750 3.646 
40 0.255 0.681 1.303 1.684 2.021 2.423 2.704 3.551 
60 0.254 0.679 1.296 1.671 2.000 2.390 2.660 3.460 
120 0.254 0.677 1.289 1.658 1.980 2.358 2.617 3.373 
co 0.253 0.674 1.282 1.645 1.960 2.326 2.576 3.291 





Adapted from W.H. Beyer, Ed., in CRC Handbook of Tables for Probability and Statistics, 2d ed., The 
Chemical Rubber Co., Cleveland, 1968; p. 283. With permission of CRC Press, Inc. 
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a priori, 15 

a posteriori probability, 48—49, 
123, 125 

a priori probability, 48, 378, 
406—407, 445 

additive noise, 184, 217 

adjacent-sample difference, 
107-108 

adjoint operator, 587 

almost-diagonal covariance 
matrix 

dependent random variables 

example, 325 

almost sure convergence 

defined, 527 

alternative derivation of Poisson 
process, 567—568 

alternative hypothesis, 402 

analytic continuation, 506 

applied probability, 352 

arrival times, 470 

asymmetric Markov chain 
(AMC), 518-519 

asymmetric two-state Markov 
chain example, 518-519 

asymptotically stationary 
autocorrelation (ASA) 
function, 520 

asymptotically WSS, 497 


asynchronous binary signaling 
(ABS) process, 560-562 
autocorrelation functions, 
468 
ABS, 562 
RTS, 599 
WSS properties, 592 
autocorrelation impulse response 
(AIR), 500, 595 
autocorrelation matrix, 324 
autocovariance function, 569 
autoregression, 482, 515 
autoregressive moving average 
(ARMA), 515 
average power in frequency band 
theorem, 606-607 
average probability, 123 
axiomatic definition of 
probability, 27-32 
axiomatic theory, 17 


B 

Bayes, Thomas, 47 

Bayes’ formula for probability 
density functions, 123 

Bayesian decision theory, 
403-407 

Bayes strategy, 405-407 

Bayes’ theorem proof, 47—49 

Bernoulli PMF, 114 


Bernoulli random sequence, 
459-460, 571 
Bernoulli RV, 356-357, 378-381 
beta pdf, 111 
example, 171 
binomial law in Bernoulli trials, 
60-69 
binomial coefficient, 52 
binomial counting sequence 
example, 534 
binomial distribution function, 
65 
binomial law asymptotic 
behavior, 69-75 
normal approximation, 
75-77 
binomial PMF, 284 
binomial random variables sum 
example, 201-202 
variance, 254 
birth-death chain, 520 
birth-death Markov chains, 
579-583 
process, 579 
Boltzmann constant, 57 
Boltzmann law, 57 
Borel field, 26 
Borel function, 230 
Borel subsets, 93, 528 
Bose-Einstein statistics, 
57-58 
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bounded-input bounded-output 
(BIBO), 488 
black-lung disease 
recognition of, 308 
Brownian motion, 572-575 


Cc 
carrier signal, 570 
Cauchy, Auguste Louis, 187 
Cauchy convergence criterion, 
525-533 
Cauchy pdf, 154 
example of, 235-236 
Cauchy probability law, 187 
Cauchy-Schwarz inequality, 560 
Cauchy sequence of measurable 
functions, 527 
causal probability, 48 
CDF, see cumulative distribution 
function (CDF) 
centered Poisson process, 603 
centered process, 558, 586 
centered random sequence, 469 
central limit theorem (CLT), 
284, 288-293, 353, 433, 
461, 573 
example, 293 
central moment 
defined, 254 
certain event, 20 
chain rule of probability, 512 
change detector 
example, 587-588 
see also edge detector 
Chapman-Kolmogorov equations, 
283-584 
characteristic equation, 485-486, 
521 
characteristic function (CF), 
278-293 
Normal law, 343-344 
proof, 288 
random vectors, 340-343 
Chebyshev, Pafnuti L., 267 
Chebyshev’s inequality, 267-273, 
362, 530 
Chernoff bound, 273, 276-278 
Chi-square pdf, 109-110 
example, 239-242 
with n degrees-of-freedom, 240 
closed intervals, 92 
collection of realizations, 454 
column vector, 326-328 


combinatorics, 50-60 
communications 
examples, 41, 48, 158, 164, 
204-205, 252 
complement, 22 
complex random sequence, 468 
composite hypotheses, 414-415 
F-test, 424-427 
generalized likelihood ratio 
test (GLRT), 415-420 
test for equality of means of 
two populations, 420—424 
variance of normal population, 
424-427 
computerized tomography, 263 
paradigm, 264 
conditional CDF, 121, 122 
conditional densities, 146-148 
conditional distributions, 
119-149 
functions, 122, 310 
conditional expectations, 
244-253 
properties, 253 
as random variable, 251 
communication system 
example, 252 
conditional failure rate, 150 
conditional mean, 253 
conditional pdf, 122 
conditional expectation, 246 
linear combination, 310 
conditional probabilities, 32-38 
confidence interval estimation, 
375 
mean, 375-376 
confidence interval for median, 
440-441 
confidence intervals, 396 
conjugate symmetry property, 
593 
consistency, 368-369 
estimator, 359 
example, 467 
guaranteed, 467 
constant mean function, 477 
continuity probability measure, 
464-466 
continuous operator, 604 
continuous random variable, 
112-115 
continuous sample space random 
process, 557 


continuous system, 604 
continuous-time linear systems 
random inputs, 584-590 
continuous-time linear system 
theory, 483 
continuous-time Markov chain, 
516 
continuous-valued Markov 
process, 576 
continuous-valued Markov 
random sequences, 
512-523 
continuous-valued random 
sequence, 468 
contours of constant density, 268 
joint Gaussian pdf, 265-267 
convergence 
of deterministic sequences, 526 
of functions, 526 
in probability, 527-533 
of random sequences 
example, 528-529 
for random sequences 
Venn diagram showing, 532 
convergent sequences 
example, 526 
convolution 
integral, 190 
theorem, 488 
convolution-type problems 
example, 190-192 
coordinate transformation in 
Normal case, 267 
correlated noise, 461 
example, 460—462 
correlated samples, 337 
correlation coefficient, 146, 259 
calculating, 492 
coefficient estimate, 373 
correlation function, 476, 494, 
558 
definition of, 468, 476 
example, 500 
properties 
psd table, 597 
random sequence with memory 
example, 478—483 
correlation matrix, 324 
countable additivity axiom, 459 
countable random variables, 558 
intersections, 25 
unions, 25 
countably additive, 458 
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covariance, 372-373 
covariance function, 476, 558 
recursive system 
example, 495-498 
covariance matrices, 323-330 
almost-diagonal 
example, 325 
diagonalization, 328 
properties, 326-331 
whitening transformation, 
330-331 
cross-correlation function, 586 
example of, 497 
theorem, 492-493 
WSS properties, 592 
cross-power spectral density, 502, 
604 
cumulative distribution function 
(CDF), 92, 95, 116 
computation of F x (x), 97-100 
conditional, 120, 122 
defined, 95 
joint, 130-135, 136 
properties of 96-97 
random vectors, 308 
random sequence, 466 
Tables of, 110, 116 
transformation of 
example, 172 
unconditional, 121, 310 
cyclostationary, 509 
processes, 612-617 
waveforms, 509 


D 
Davenport, Wilbur, 230 
decimation, 508-509 
example, 508 
decision function, 404 
deconvolution, 500 
decorrelation of random vectors 
example, 328-329 
decreasing sequence, 465 
degrees of freedom (DOF), 365 
De Moivre, Abraham, 289 
De Morgan, Augustus, 24 
De Morgan’s laws, 24 
densities of RVs, 119-149 
computation by induction, 
471 
table of CDFs, 116 
tables of means and variances, 
258 


table of pdf’s, 110 
see also pdf 
dependent random variables 
almost-diagonal covariance 
matrix 
example, 325-326 
derivative 
of quadratic forms, 393-394 
of scalar product, 394-395 
of WSS process example, 596 
deterministic sequences 
convergence, 525 
deterministic vectors, 308 
deviation from the mean for a 
Normal RV 
example, 269 
diagonal dominance, 560 
diagonalization of covariance 
matrices, 328 
simultaneous of two matrices, 
331 
see also whitening 
difference of two sets, 22 
differential equations, 608—612 
example of, 484-485 
solution of, 484-485 
digital modulation, 560 
PSK, 570 
Dirac, Paul A. M., 113 
Dirac delta functions, 113, 165, 
B-3 
see also impulse 
direct dependence, 461, 611 
concept, 514 
direct method for pdf’s, 178 
discrete convolution of PMFs, 
283-284 
discrete random vector, 114-115, 
340-343 
discrete-time Fourier transform, 
501 
discrete-time impulse, 472 
discrete-time linear systems 
principles, 483—486 
shift invariant, 489 
discrete-time Markov chains, 516 
defined, 516 
discrete-time signal, 487 
discrete-time simulation, 505 
and synthesis of sequences, 
505-508 
discrete-time systems review, 
483 
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discrete-valued Markov random 
sequence, 512-515, 576 

discrete random variables, 
112-113 

distance preservation, 328 

see also unitary 

distribution-free estimation, 396 

distribution-free hypothesis 
testing, 441-444 

distribution-free/nonparametric 
statistics, 384 

distribution function, 95—100 

doubly stochastic, 125 

driven solution, 622 

Durant, John, 19 


E 
edge detector 
example of, 494—495, 587—588 
input correlation function, 
495 
using impulse response, 500 
Eigenfunctions, 488 
Eigenvalues, 326-328 
Eigenvector, 326-328, 330 
matrix, 328-329 
electric-circuit theory 
example, 163-164 
elementary events, 29 
energy norm, 529 
Erlang pdf, 471 
error function, 103 
error probability, 410 
estimation, 272, 358 
consistent, 359 
of covariance and means, 
388-392 
expectation and introduction, 
227-304 
minimum-variance unbiased, 
359 
MMSE, 359 
multidimensional distribution, 
314 
observation vector, 359 
vector means, 388-392 
estimators, 272, 276, 352, 358, 
360 
maximum likelihood, 377 
parametric, 384 
Euclidean distance, 328 
Euclidean sample spaces, 26 
Euler’s summation formula, 45 
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event probabilities 
normal approximation, 76 
events, 20-26 
exclusive-or of two sets, 22 
expectation, 227 
of a discrete RV, 229 
operator, 236 
linearity of, 255 
of an RV, 227 
of a random vector, 323-325 
expected value 
Tables of, 258 
see also moment 
exponential autocorrelation 
function, 599 
exponential pdf, 107 
exponential RV, 203 


F 

failure rates, 149-153 

failure time, 577 

feedback filter, 461 

Feller, William, 50 

Fermi-Dirac statistics, 58 

fields, 20, 25 

filtered-convolution 
back-projection, 264 

filtering of independent 
sequences, 461 

finite additivity, 458 

finite capacity buffer 

example, 582-583 

finite energy norm, 529 

finite state space, 516 

finite-state Markov chain, 516 

Fisher, Ronald Aylmer, 111 

force of mortality, see 
conditional failure rate 

Fourier transform, 125, 278-279, 
483, 487 

frequency function, 113 

frequency of occurrence measure, 
16-17 

F-test, 424-427 

function-of-a-random-variable 
(FRV) problems, 163-165 

functions of random variables, 

163-217 


G 
gamma pdf, 110 
see also Erlang pdf 
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Gauss, Carl F., 101 
Gaussian 
characteristic function, 
278-293 
standard Normal, 103-107 
density, 101 
joint Gaussian, 265 
marginal, 264 
noise, 461 
pdf, 101 
random vector, 331-340 
see also Normal (Gaussian) 
Gaussian law, 314 
Gaussian random process 
defined, 574 
Gaussian random sequence, 472, 
490 
Gaussian random vector, 472 
Gauss-Markov vector random 
process, 623 
Gauss Markov random sequence 
example, 513 
generalized eigenvalue, 331 
equations, 332 
generalized likelihood ratio test 
(GLRT), 415-420, 445 
generator Markov chain, 579 
generator matrix 
Markov chain, 581 
generic linear system 
system diagram, 483 
generic two-channel LSI system, 
619 
geometric series, 57, A-3 
geometric RV, 244 
GLRT, see generalized likelihood 
ratio test (GLRT) 
goodness of fit, 429 
Gossett, W. S., 111 


H 
half-closed interval, 92 
half-open interval, 92 
half-wave rectifier 

example, 170—171 
hard clipper, 569 
hazard rate, see conditional 

failure rate 

Helstrom, Carl, 237 
Hermitian matrices, 324 
Hermitian symmetry, 469, 560 
homogeneous equation, 484-485 
hypothesis testing, 402—403, 445 


Bayesian decision theory, 
403—407 
composite hypotheses, 414—415 
F-test, 424-427 
generalized likelihood ratio 
test (GLRT), 415-420 
test for equality of means of 
two populations, 420-424 
variance of normal 
population, 428—429 
goodness of FIT, 429-435 
likelihood ratio test, 408—414 
ordering, percentiles, and rank, 
435-440 
confidence interval for 
median, 440-441 
distribution-free hypothesis 
testing, 441-444 
ranking test for sameness of 
two populations, 444—445 


I 
impossible event, 22 
impulse, 113 
function, 113 
response, 487 
see also discrete-time impulse; 
Dirac delta functions 
increasing sequence, 464-465 
theorem, 464 
independence, 32—47 
definitions of, 33-34 
independent and identically 
distributed (i.id.), 174, 
186, 198, 201, 353 
and CLT, 292 
sum of i.i.d. binomial RVs, 283 
and LLN, 272 
independent increments, 564-565 
property 
defined, 475 
random sequence, 476, 535 
independent random sequence, 
456 
independent random process, 591 
independent random variables, 
137-138 
sum of, 189-194 
independent random vectors, 324 
indirect dependence, 611 
induction, see mathematical 
induction 
infinite intersections, 457 
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infinite length Bernoulli trials, 
459—460 
infinite length queues 
birth-death process, 580 
infinite length random sequences, 
457 
infinite root transmittance 
example of, 182-183 
infinitesimal parallelepiped, 
312-313 
infinitesimal parallelogram, 209 
infinitesimal rectangle, 209 
infinitesimal rectangular 
parallelepiped, 312-313 
infinitesimal] volume 
ratio of, 312 
initial rest condition, 486 
inner products, 271 
instantaneous failure rate, see 
conditional failure rate 
intensity, see mean-arrival rate 
intensity rate, see conditional 
failure rate 
interarrival times, 470 
interpolation, 509-512 
example of, 508 
interpretation 
of psd, 502, 598-599 
intersection of sets, 22 
intuitive probability, 15 
invariance property of MLE, 381 
inverse Fourier transform, 125, 
285, 487, 544, 596 
inverse image, 93 
inverse two-sided Laplace 
transform, 609, A-3 


J 
Jacobian, 334, 599 
computation, 314 
magnitude, 321 
transformation, 210 
joint characteristic functions, 
285-288 
example, 286-287 
joint densities 
of random variables, 128—146 
of random vectors, 307-310 
joint distribution, 119—149 
of random vectors, 307-310 
joint Gaussian density graph of, 
144 
joint Gaussian distribution, 264 


joint Gaussian pdf, 144, 262 
contour of constant density, 
265-267 
joint Gaussian random variables, 
263-265 
joint moments, 258-260 
defined, 258-259 
joint PMF, 310-343 
defined and conditional 
expectation, 246-247 
joint probability 
of events, 32-47 
joint stationary random 
processes, 593-612 


K 

Kalman filter, 515, 523-524 

Kolmogorov, Andrei, 14, 26, 458, 
460 


L 
Lagrange method, 257 
Laplace pdf, 108 
Laplace transform, 609 
law of large numbers, 271-272 
convergence, 533-538 
in statistics, 383 
strong law, 537 
weak laws, 533-534 
Lebesgue measure, 528 
likelihood function, 378 
likelihood ratio test, 408-414 
likelihood ration test (LRT), 
445 
linear amplifier with cutoff 
example of, 181-183 
linear combination 
of conditional pdf, 310 
linear constant coefficient 
differential equation 
(LCCDE), 484, 608 
example, 523, 611 
linear continuous-time system 
defined, 585 
linear differential equations 
(LDEs) random processes, 
567 
linear estimation, 359 
of vector parameters, 392-396 
linearity 
expectation operator, 560 
linear operator, 483 


linear prediction 
example of, 261-262 
linear regression 
example, 261-262 
linear shift-invariant (LSI), 
486-487 
systems, 593-612 
linear systems 
with input random sequence, 
489-490 
WSS inputs 
input/output relations, 606 
linear time-invariant (LTI), 486 
see also linear shift-invariant 
(LSI) 
log-likelihood 
function, 379 
loss functions, 403 
lowpass filter 
example, 492 


M 
marginal density, 310 
marginal pdf, 323, 342 
defined, 237 
random vector, 310 
Markov, A. A., 483 
Markov chain, 516-523, 576 
asymmetric two-state 
example of, 518-519 
birth-death, 579-581 
continuous-time, 516 
discrete-time, 516 
defined, 516 
finite-state, 516 
generator matrix, 581 
Markov inequality, 269-270 
Markov-p random sequence, 
§14-515 
defined, 514, 516 
example, 525 
scalar, 522-523 
Markov process 
continuous-valued, 576 
Markov random process, 
575-579 
defined, 576 
vector 
defined, 621-622 
Markov random sequence, 483, 
512-513 
continuous-valued, 512-513 
discrete-valued, 512-513, 576 
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Markov state diagram 
for birth-death process, 580 
Markov vector random sequence, 
514-515 
Martingale, 534 
Martingale convergence theorem, 
536-538 
Martingale sequence 
theorem, 535-536 
MATLAB 
average number of calls, 
242-244 
integral approximation, 
142-144 
psd plotting, 600 
random sequence with 
memory, 503-504 
simulation, 462-463 
mathematical induction, 471, 
A-17 
maximum entropy (ME) 
example of, 256-258 
maximum likelihood (ML) 
principle, 377-378 
maximum-likelihood estimator 
(MLE), 377-381, 396, 414 
max operator, see supremum 
operator 
Maxwell-Boltzmann statistics, 
57 
mean and variance, simultaneous 
estimation of, 373-375 
mean-arrival rate, 564 
mean confidence interval for, 
364-366 
mean-estimator function (MEF), 
360, 361-364, 377 
mean function 
of random sequences, 558 
mean-square 
convergence, 529 
error, 253 
periodic, 612 
values, 257 
mean-square error (MSE), 253, 
261 
minimum MSE (MMSE), 359 
mean values, Tables of, 258 
measurable function, 527 
measure theory, 458 
memoryless property 
of exponential pdf, 564 
Merzbacher, Eugen, 13-14 





minimum mean-square error 
(MMSE), 359 
minimum-variance unbiased 
estimator, 359 
miscalculations 
in probability, 19-20 
misuses in probability, 19—20 
mixed random sequence, 468 
mixed random variables, 112-119 
mixture distribution function, 
310 
mixture pdf, 310 
modified trellis diagram, 519 
moment, 227, 254-267, 314 
estimator, 275 
moment generating function 
(MGF), 273 
of random sequence, 468, 490 
Tables of, 258 
monte-Carlo simulation, 292 
moving average, 301, 484, 515 
multidimensional Gaussian law, 
314, 331-340 
multidimensional Gaussian pdf, 
325 
multinomial Bernoulli trials, 
60-69 
multinomial coefficient, 53 
multinomial formula, 66-69 
exercises dealing with, 84 
multiple-parameter ML 
estimation, 379 
multiple transformation 
of random variables, 311-314 
multiplier (Product of RVs) 
example, 185-186 
multiprocessor reliability 
example, 577 


N 
Neyman, J., 111 
Neyman-—Pearson theorem 
(NPT), 413-414 
noise 
atmospheric, 17 
communication channel, 41 
correlated, 460 
Gaussian noise, 165 
narrow-band, 204 
noise voltage, 102 
resistor noise, 154 
white noise, 394, 506 
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non-Gaussian parameters, 
375-377 
nonindependent random 
variables 
joint densities of, 141-142 
nonlinear devices 
example, 181-182 
nonmeasurable subsets, 92 
nonnegative random variables, 
269 
nonnegative RV, 269 
nonstationary first-order Erlang 
density, 563 
nonparametric statistics, 437 
Normal approximation, 388, 
440-441 
to binomial law, 75-77 
to event probabilities, 76 
to Poisson law, 77 
see also Gaussian 
Normal law, 75 
normalized covariance, 249, 325, 
373 
normalized frequency, 488 
Normal (Gaussian) 
characteristic function, 280, 
343 
joint pdf, 215, 265 
pdf, 101 
random vector, 334 
NPT, see Neyman—Pearson 
theorem (NPT) 
numerical average, 360 


(0) 


observation vector 
estimator, 359 
occupancy numbers, 55 
occupancy problems, 54 
open sets 
intervals, 92 
operator L, 484 
linear, 483 
optimum linear prediction 
example, 261-262 
ordered random variables, 
314-317 
distribution of area random 
variables, 317-323 
ordered sample, 51 
ordering 
subpopulation, 51 
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ordering, percentiles, and rank, 
435-440 
confidence interval for median, 
441-441 
distribution-free hypothesis 
testing, 441—444 
ranking test for sameness of 
two populations, 444-445 
orthogonal 
random processes, 590 
random vector, 324 
orthogonal random vector, 324 
orthogonal unit eigenvectors, 329 
orthonormal eigenvectors 
computation, 337 
outcomes, 15-16 
output autocorrelation function, 
494 
WSS, 594-595 
output-correlation functions 
theorem, 492—494 
output covariance 
calculating, 492 
output moment functions, 490 
output random sequence mean 
theorem, 490-492 


P 
packet switching 
example, 582 
Papoulis, Athanasios, 167 
paradoxes 
in probability, 19-20 
parallelepipeds 
union and intersection, 309 
parallel operation (maximum 
operation) 
example, 187 
parameter estimation, 352—396 
estimators, 358-360 
independent, identically 
distributed (i.i.d.) 
observations, 353-355 
linear estimation of vector 
parameters, 392-396 
maximum likelihood 
estimators, 377-381 
mean and variance, 
simultaneous estimation 
of, 373-375 
mean, estimation of, 360-361 
6-confidence interval, 364, 
367 
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mean-estimator function 
(MEF), 361-364 
normal distribution, 364-366 
median of population versus 
its mean, 383-384 
non-Gaussian parameters from 
large samples, 375-377 


parametric versus nonparametric 


statistics, 381-383, 
384-385 
confidence interval for median 
when n is large, 387-388 
confidence interval on 
percentile, 385-387 
median of population versus 
its mean, 383-384 
probabilities, estimation of, 
355-358 
variance and covariance, 
367-369 
confidence interval, 369-371 
covariance, estimating, 
372-373 
standard deviation directly, 
estimating, 371-372 
vector means and covariance 
matrices, 388-389 
u, estimation of, 389-390 
covariance K, estimation of, 
390-392 
parametric case, 437 
parametric statistics, 384, 437 
particular solution, 485 
Pauli, Wolfgang, 58 
P-convergence, 532 
Pearson, E. S., 111 
Pearson test statistic, 431, 433 
periodic processes, 612-617 
pdf, see probability density 
function (pdf) 
phase recovery, 500 
phase-shift keying (PSK), 
560 
digital modulation, 570 
phase space, 57 
Planck’s constant, 57 
PMF, see probability mass 
function (PMF) 
points, 71 
poisson counting process, 560, 
562-566 
Poisson law, 69-75 
compound Poisson, 125 


exercises dealing with, 85 
random variable, 114-115 
Poisson process, 562 
alternative derivation, 
567-569 
sum of two independent 
example, 566 
Poisson characteristic function, 
343 
sum 
example, 200 
Poisson rate parameter, 72 
Poisson transform, 125, 248 
population, 353, 435 
positive definite, 326 
positively correlated, 265 
positive semidefinite, 326-327, 
560 
autocorrelation functions 
property, 498 
correlation function 
theorem, 608 
power spectral density (psd), 
361, 363-365, 455-460 
correlation function properties 
table, 597 
defined, 596 
interpretation, 502, 598-608 
properties, 501 
PSK 
example, 616-618 
stationary random sequences, 
503-504 
transfer function, 605 
triangular autocorrelation 
example, 601 
white noise 
example, 598 
predicted value, 249 
prima facie evidence, 262 
probability 
axiomatic definition of, 
27-32 
estimation of, 355-358 
exercises dealing with, 78-89 
theory of, 29 
types, 12-18 
probability-1 (almost sure) 
convergence, 527-533 
probability density function 
(pdf), 100-112, 229, 308, 
516 
Bayes’ formula, 23 
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probability density function 
(pdf) (Continued) 
Cauchy RV 
example, 235-236 
Chi-square, 240 
conditional, 122 
conditional expectation, 246 
linear combination, 310 
Erlang RV, 471 
exponential pdf, 107 
Gaussian, 101-102, 268, 280, 
344 
conversion, 103-105 
Gaussian marginal, 264 
joint 
conditional expectation, 246 
joint Gaussian, 212 
contour of constant density, 
266-267 
Laplacian pdf, 108 
marginal, 237, 310, 323, 342 
mixture, 310 
multidimensional Gaussian, 
325 
Normal (Gaussian) RV, 
101-102, 268, 280, 344, see 
also Gaussian 
Rayleigh pdf, 107, 204 
Rice-Nakagami, 204 
Table of, 110 
uniform pdf, 107 
univarjate normal, 101, 334 
probability laws, 60—69 
exercises dealing with, 84 
probability mass function 
(PMF), 99, 112, 229, 378, 
516 
discrete convolution, 283 
Poisson counting process, 564 
Table of, 116 
probability measure continuity, 
464—466 
probability space, 26 


Q 
quantizing 
in A/D conversion, 173 
example, 173-176 
in image compression, 109 
queueing process, 579 
queue length 
finite, 581-583 
infinite, 580-581 


R 
radioactivity monitor 
example, 565-566 
random complex exponential 
example, 592 
random inputs 
continuous-time linear 
systems, 584—590 
random process, 555-623 
classifications of, 590-592 
defined, 556-560 
exercises dealing with, 
623-646 
generated from random 
sequences, 584 
random pulse sequence 
example, 531 
random sample of size n, 353 
random sequence, 453-538 
concepts, 454-483 
consistency of higher-order 
cdf’s, 467 
convergence of, 525-533 
defined, 454—455 
exercises dealing with, 538-553 
finite support 
example of, 455 
illustration of, 454 
input/output relations, 505 
linear systems and, 489-498 
random process generated, 584 
statistical specification of, 
466-483 
synthesis of, 505-508 
tree diagram of 
example, 456—457 
random telegraph signal (RTS), 
560, 569-570 
autocorrelation function of, 
599 
random variables, 91-153, 527 
definition of, 92-95 
exercises dealing with, 
153-161 
functions of, 163-217 
input/output view, 166-167 
multiple transformation of, 
311-314 
symbolic representation, 93 
random vectors 
characteristic functions of, 
330-343 
characterized as, 314 


classified as, 324 
expectation vectors and 
covariance matrices, 323-325 
functionally independent, 311 
joint densities, 307-311 
marginal pdf, 310 
random walk problem 
displacement, 185 
random walk sequence 
example, 473—475 
ranking test for sameness of two 
populations, 444—445 
Rayleigh density function, 204 
Rayleigh distribution, 204 
Rayleigh law, 339 
Rayleigh pdf, 107, 204 
realizations of random sequence, 
454 
real symmetric, matrices, 324 
real-valued random process 
example of, 588-589 
theorem regarding, 607 
real-valued random variable 
example of, 592 
region of absolute convergence, 
489 
telative frequency approach, 16 
renewal process, 569 
Rice, S. O., 204 
Rice-Nakagami pdf, 204 
Rician density 
example, 204-205 
Riemann, Bernhard, 231 
Riemann sum, 231 
rotational transformer, 206 
running time-average 
example, 516 


S 

sample space, 20-21, 308 
space, 308 

sample mean, 360 

sample mean estimator (SME), 

375 

example, 272 

sample sequence, 454 
construction example, 462 
random walk, 474 

sample space, 20 
illustration, 454 

sampling 
distribution, 365 
with replacement, 51 
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theory, 511 
without replacement, 51-52 
scalar Markov-p 
example, 525 
scalar product, 271 
derivative of, 392-395 
scalar random sequence, 523-524 
Schwarz, H. Amandus, 270 
Schwarz inequalities, 267-273 
second-order joint moments, 258 
semidefinite functions 
defined, 560 
separability of random process, 
590 
separable random sequences, 
558 
example, 355 
set algebra, 22 
sets, 20-26 
shift-invariant, 477, 486 
covariance function, 497 
short-time state-transition 
diagram, 578 
sigma algebra, 25 
sigma fields, 25 
simultaneous diagonalization 
two covariance matrices, 331 
sine wave, 176-178 
six-dimensional space, 57 
spectral factorization, 506 
square-law detector 
example, 169-170, 196-199 
standard deviation, 228 
standard Normal density, 75 
see also Gaussian 
standard Normal distribution, 75 
see also Gaussian 
state equations, 523-525, 
618-623 
state of the process, 576 
state-transition diagram, 479, 
578 
concept 
example, 516-517 
state-transition matrix, 516 
state-variable 
representation, 525 
stationary, 196 
processes, 608-612 
psd, 504-505 
random process 
defined, 590-591 
random sequences, 476-478 


statistically specified random 
process, 557 
statistical pattern recognition, 
331 
statistical specification 
of random sequence, 466—483 
of random process, 556 
steady state, 522 
autocorrelation function 
asymptotic stationary 
(ASA), 520 
Stieltjes integral, 468 
Stirling, James, 69 
Stirling’s formula, 69 
stochastic processes 
transformation, 584-590 
Strong Law of Large Numbers, 
528 
theorem, 537 
student-t pdf, 110 
subpopulation, 51-53 
supremum operator, 529 
superposition, summation, 486 
sure convergence 
defined, 527 
symmetric exponential 
correlation function 
RTS, 570 
system function, 489 


T 


Taylor series, 290 
temporally coherent, 148 
test for equality of means of two 
populations, 420—424 
time-variant impulse response, 
486 
total probabilities, 32-47 
transfer function 
LSI system 
example, 605 
transformation of CDFs 
example of, 172 
transition probabilities, 576 
transition time, 577 
trapping state, 523 
Trellis diagram 
Markov chain, 519 
triangular autocorrelation 
function, 601 
tri-diagnonal correlation function 
diagram, 473 





two-state random sequence with 
memory 
example, 478-479 
two-variable-to-two variable 
matrixer, 206 


U 
unbiasedness, 361 
estimator, 434, 437 
unconditional CDF, 122, 583 
unconditional probability, 46 
uncorrelated 
random processes, 590 
random variables 
properties of, 260-261 
random vector, 324 
samples, 337 
sequence, 461 
uncountable, 28, 230 
uncoupled two-channel LSI 
system, 619 
uniform law, 102 
uniform pdf, 107 
uniform random number 
generators (URNG), 293 
union of sets (events), 22 
unitary 
matrices, 328 
unit-step function, 563 
univariate normal pdf, 101, 334 
universal set, 22 
upsampled (expansion), 510 


Vv 
variance and covariance, 367-369 
confidence interval, 369-371 
covariance, estimating, 
372-373 
standard deviation directly, 
estimating, 371-372 
Tables of, 258 
variance-estimator function 
(VEF), 360-361 
variance function, 469 
variance of normal population, 
428-427 
variation of parameters, 568 
vector convolution 
defined, 621 
vector Markov random sequence, 
524 
example, 525 
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vector Markov random process 
defined, 622 
vector means and covariance 
matrices, 388-389 
u, estimation of, 389-390 
covariance K, estimation of, 
390-392 
vector parameters 
linear estimation of, 392-395 
vector processes, 619-624 
vector random sequence, 523-525 
Venn diagram, 23 
axiomatic definition of 
probability, 27-32 
V=g(X, Y) W=h(%, Y) 
problems of type, 205-212 
Viterbi algorithm, 520 
Von Mises, Richard, 14, 16 


w 

waiting times, 470 
example, 469—470 

weak law-nonuniform variance 
theorem, 533 

weak law of large numbers, 271 
theorem, 533 


weighted average, 228 
white Gaussian random 
sequence, 525 
whitening, 329, 330 
transformation, 330 
white noise, 589, 602 
wide-sense cyclostationary 
random process 
defined, 614 
wide-sense Markov of order 14 
wide-sense periodic stationary, 
612 
wide-sense stationary (WSS), 
477-478, 559 
covariance function 
example, 477 
cross-correlation matrices, 524 
defined for, 476 
processes, 593-612 
derivative example, 596 
PSK 
example, 616 
random process 
defined, 592 
random sequences, 498-512 
defined, 498 
input/output relations, 505 
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Wiener, Norbert, 572 

Wiener—Levy process, 573 

Wiener process, 560, 572-576, 
603 

Wishart distribution, 391 


Y 


Y = g(X) problems, 167-183 
general formula of 
determining, 178-179 


Zz 


zero crossing 

information in, 569 
zero-input solution, 623 
zero-mean Gaussian RV, 249 
zero-mean random sequence 

example of, 507 
zero-order modifed Bessel 

function, 205 
zero-state solution, 622 

= g(X, Y ) 

solving problems of type, 

167-183 

Z-transforms, 483, 489, 496 


